1. 28 5月, 2010 40 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
    • C
      rename the generic fsync implementations · 1b061d92
      Christoph Hellwig 提交于
      We don't name our generic fsync implementations very well currently.
      The no-op implementation for in-memory filesystems currently is called
      simple_sync_file which doesn't make too much sense to start with,
      the the generic one for simple filesystems is called simple_fsync
      which can lead to some confusion.
      
      This patch renames the generic file fsync method to generic_file_fsync
      to match the other generic_file_* routines it is supposed to be used
      with, and the no-op implementation to noop_fsync to make it obvious
      what to expect.  In addition add some documentation for both methods.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b061d92
    • C
      drop unused dentry argument to ->fsync · 7ea80859
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea80859
    • A
      get rid of the magic around f_count in aio · d7065da0
      Al Viro 提交于
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d7065da0
    • J
      hwmon: (asus_atk0110) Don't load if ACPI resources aren't enforced · 70dd6bea
      Jean Delvare 提交于
      When the user passes the kernel parameter acpi_enforce_resources=lax,
      the ACPI resources are no longer protected, so a native driver can
      make use of them. In that case, we do not want the asus_atk0110 to be
      loaded. Unfortunately, this driver loads automatically due to its
      MODULE_DEVICE_TABLE, so the user ends up with two drivers loaded for
      the same device - this is bad.
      
      So I suggest that we prevent the asus_atk0110 driver from loading if
      acpi_enforce_resources=lax.
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      Acked-by: NLuca Tettamanti <kronos.it@gmail.com>
      Cc: Len Brown <lenb@kernel.org>
      70dd6bea
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
    • J
      vfs: introduce noop_llseek() · ae6afc3f
      jan Blunck 提交于
      This is an implementation of ->llseek useable for the rare special case
      when userspace expects the seek to succeed but the (device) file is
      actually not able to perform the seek.  In this case you use noop_llseek()
      instead of falling back to the default implementation of ->llseek.
      Signed-off-by: NJan Blunck <jblunck@suse.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6afc3f
    • J
      aio: fix the compat vectored operations · 9d85cba7
      Jeff Moyer 提交于
      The aio compat code was not converting the struct iovecs from 32bit to
      64bit pointers, causing either EINVAL to be returned from io_getevents, or
      EFAULT as the result of the I/O.  This patch passes a compat flag to
      io_submit to signal that pointer conversion is necessary for a given iocb
      array.
      
      A variant of this was tested by Michael Tokarev.  I have also updated the
      libaio test harness to exercise this code path with good success.
      Further, I grabbed a copy of ltp and ran the
      testcases/kernel/syscall/readv and writev tests there (compiled with -m32
      on my 64bit system).  All seems happy, but extra eyes on this would be
      welcome.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d85cba7
    • J
      compat: factor out compat_rw_copy_check_uvector from compat_do_readv_writev · b8373363
      Jeff Moyer 提交于
      It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and
      writev AIO operations were not functioning properly.  It turns out that
      the code to convert the 32bit io vectors to 64 bits was never written.
      The results of that can be pretty bad, but in my testing, it mostly ended
      up in generating EFAULT as we walked off the list of I/O vectors provided.
      
      This patch set fixes the problem in my environment.  are greatly
      appreciated.
      
      This patch:
      
      Factor out code that will be used by both compat_do_readv_writev and the
      compat aio submission code paths.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>		[2.6.35.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8373363
    • F
      dma-mapping: remove deprecated dma_sync_single and dma_sync_sg API · 99d1bd2c
      FUJITA Tomonori 提交于
      Since 2.6.5, it had been commented, 'for backwards compatibility,
      removed in 2.7.x'. Since 2.6.31, it have been marked as __deprecated.
      
      I think that we can remove the API safely now.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d1bd2c
    • F
      dma-mapping: remove unnecessary sync_single_range_* in dma_map_ops · 5fd75a78
      FUJITA Tomonori 提交于
      sync_single_range_for_cpu and sync_single_range_for_device hooks are
      unnecessary because sync_single_for_cpu and sync_single_for_device can
      be used instead.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fd75a78
    • F
      swiotlb: remove unnecessary swiotlb_sync_single_range_* · 38388301
      FUJITA Tomonori 提交于
      swiotlb_sync_single_range_for_cpu and swiotlb_sync_single_range_for_device
      are unnecessary because swiotlb_sync_single_for_cpu and
      swiotlb_sync_single_for_device can be used instead.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38388301
    • J
      lib/random32: export pseudo-random number generator for modules · 5960164f
      Joe Eykholt 提交于
      This patch moves the definition of struct rnd_state and the inline
      __seed() function to linux/random.h.  It renames the static __random32()
      function to prandom32() and exports it for use in modules.
      
      prandom32() is useful as a privately-seeded pseudo random number generator
      that can give the same result every time it is initialized.
      
      For FCoE FC-BB-6 VN2VN mode self-selected unique FC address generation, we
      need an pseudo-random number generator seeded with the 64-bit world-wide
      port name.  A truly random generator or one seeded with randomness won't
      do because the same sequence of numbers should be generated each time we
      boot or the link comes up.
      
      A prandom32_seed() inline function is added to the header file.  It is
      inlined not for speed, but so the function won't be expanded in the base
      kernel, but only in the module that uses it.
      Signed-off-by: NJoe Eykholt <jeykholt@cisco.com>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5960164f
    • O
      INIT_SIGHAND: use SIG_DFL instead of NULL · 0a14a130
      Oleg Nesterov 提交于
      Cosmetic, no changes in the compiled code. Just s/NULL/SIG_DFL/ to make
      it more readable and grep-friendly.
      
      Note: probably SIG_IGN makes more sense, we could kill ignore_signals().
      But then kernel_init() should do flush_signal_handlers() before exec().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Mathias Krause <Mathias.Krause@secunet.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a14a130
    • O
      pids: init_struct_pid.tasks should never see the swapper process · f2001145
      Oleg Nesterov 提交于
      "statically initialize struct pid for swapper" commit 820e45db says:
      
      	Statically initialize a struct pid for the swapper process (pid_t == 0)
      	and attach it to init_task.  This is needed so task_pid(), task_pgrp()
      	and task_session() interfaces work on the swapper process also.
      
      OK, but:
      
      	- it doesn't make sense to add init_task.pids[].node into
      	  init_struct_pid.tasks[], and in fact this just wrong.
      
      	  idle threads are special, they shouldn't be visible on any
      	  global list. In particular do_each_pid_task(init_struct_pid)
      	  shouldn't see swapper.
      
      	  This is the actual reason why kill(0, SIGKILL) from /sbin/init
      	  (which starts with 0,0 special pids) crashes the kernel. The
      	  signal sent to pgid/sid == 0 must never see idle threads, even
      	  if the previous patch fixed the crash itself.
      
      	- we have other idle threads running on the non-boot CPUs, see
      	  the next patch.
      
      Change INIT_STRUCT_PID/INIT_PID_LINK to create the empty/unhashed
      hlist_head/hlist_node. Like any other idle thread swapper can never exit,
      so detach_pid()->__hlist_del() is not possible, but we could change
      INIT_PID_LINK() to set pprev = &next if needed.
      
      All we need is the valid swapper->pids[].pid == &init_struct_pid.
      Reported-by: NMathias Krause <mathias.krause@secunet.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Mathias Krause <Mathias.Krause@secunet.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2001145
    • O
      INIT_TASK() should initialize ->thread_group list · fa2755e2
      Oleg Nesterov 提交于
      The trivial /sbin/init doing
      
      	int main(void)
      	{
      		kill(0, SIGKILL)
      	}
      
      crashes the kernel.
      
      This happens because __kill_pgrp_info(init_struct_pid) also sends SIGKILL
      to the swapper process which runs with the uninitialized ->thread_group.
      
      Change INIT_TASK() to initialize ->thread_group properly.
      
      Note: the real problem is that the swapper process must not be visible to
      signals, see the next patch. But this change is right anyway and fixes
      the crash.
      Reported-and-tested-by: NMathias Krause <mathias.krause@secunet.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Mathias Krause <Mathias.Krause@secunet.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Acked-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa2755e2
    • H
      pids: increase pid_max based on num_possible_cpus · 72680a19
      Hedi Berriche 提交于
      On a system with a substantial number of processors, the early default
      pid_max of 32k will not be enough.  A system with 1664 CPU's, there are
      25163 processes started before the login prompt.  It's estimated that with
      2048 CPU's we will pass the 32k limit.  With 4096, we'll reach that limit
      very early during the boot cycle, and processes would stall waiting for an
      available pid.
      
      This patch increases the early maximum number of pids available, and
      increases the minimum number of pids that can be set during runtime.
      
      [akpm@linux-foundation.org: fix warnings]
      Signed-off-by: NHedi Berriche <hedi@sgi.com>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg KH <gregkh@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: John Stoffel <john@stoffel.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72680a19
    • A
      rapidio: add switch domain routines · 7a88d628
      Alexandre Bounine 提交于
      Add switch specific domain routines required for 16-bit routing support in
      switches with hierarchical implementation of routing tables.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Thomas Moll <thomas.moll@sysgo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a88d628
    • A
      rapidio: modify initialization of switch operations · 058f88d6
      Alexandre Bounine 提交于
      Modify the way how RapidIO switch operations are declared.  Multiple
      assignments through the linker script replaced by single initialization
      call.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: Thomas Moll <thomas.moll@sysgo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      058f88d6
    • T
      rapidio: add enabling SRIO port RX and TX · 933af4a6
      Thomas Moll 提交于
      Add the functionality to enable Input receiver and Output transmitter of
      every port, to allow non-maintenance traffic.
      Signed-off-by: NThomas Moll <thomas.moll@sysgo.com>
      Signed-off-by: NAlexandre Bounine <abounine@tundra.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      933af4a6
    • A
      rapidio: add Port-Write handling for EM · e5cabeb3
      Alexandre Bounine 提交于
      Add RapidIO Port-Write message handling in the context of Error
         Management Extensions Specification Rev.1.3.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Tested-by: NThomas Moll <thomas.moll@sysgo.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5cabeb3
    • A
      rapidio: add IDT CPS/TSI switches · 07590ff0
      Alexandre Bounine 提交于
      Extentions to RapidIO switch support:
      
      1. modify switch route operation declarations to allow using single
         switch-specific file for family of switches that share the same route
         table operations.
      
      2. add standard route table operations for switches that that support
         route table manipulation registers as defined in the Rev.1.3 of RapidIO
         specification.
      
      3. add clear-route-table operation for switches
      
      4. add CPSxx and TSIxxx families of RapidIO switches
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Tested-by: NThomas Moll <thomas.moll@sysgo.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07590ff0
    • M
      ipc/sem.c: cacheline align the ipc spinlock for semaphores · 31a7c474
      Manfred Spraul 提交于
      Cacheline align the spinlock for sysv semaphores.  Without the patch, the
      spinlock and sem_otime [written by every semop that modified the array]
      and sem_base [read in the hot path of try_atomic_semop()] can be in the
      same cacheline.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a7c474
    • A
      notifier: change notifier_from_errno(0) to return NOTIFY_OK · b957e043
      Akinobu Mita 提交于
      This changes notifier_from_errno(0) to be NOTIFY_OK instead of
      NOTIFY_STOP_MASK | NOTIFY_OK.
      
      Currently, the notifiers which return encapsulated errno value have to
      do something like this:
      
      	err = do_something(); // returns -errno
      	if (err)
      		return notifier_from_errno(err);
      	else
      		return NOTIFY_OK;
      
      This change makes the above code simple:
      
      	err = do_something(); // returns -errno
      
      	return return notifier_from_errno(err);
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b957e043
    • O
      proc: turn signal_struct->count into "int nr_threads" · b3ac022c
      Oleg Nesterov 提交于
      No functional changes, just s/atomic_t count/int nr_threads/.
      
      With the recent changes this counter has a single user, get_nr_threads()
      And, none of its callers need the really accurate number of threads, not
      to mention each caller obviously races with fork/exit.  It is only used to
      report this value to the user-space, except first_tid() uses it to avoid
      the unnecessary while_each_thread() loop in the unlikely case.
      
      It is a bit sad we need a word in struct signal_struct for this, perhaps
      we can change get_nr_threads() to approximate the number of threads using
      signal->live and kill ->nr_threads later.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ac022c
    • O
      proc: get_nr_threads() doesn't need ->siglock any longer · 7e49827c
      Oleg Nesterov 提交于
      Now that task->signal can't go away get_nr_threads() doesn't need
      ->siglock to read signal->count.
      
      Also, make it inline, move into sched.h, and convert 2 other proc users of
      signal->count to use this (now trivial) helper.
      
      Henceforth get_nr_threads() is the only valid user of signal->count, we
      are ready to turn it into "int nr_threads" or, perhaps, kill it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e49827c
    • O
      kill the obsolete thread_group_cputime_free() helper · a705be6b
      Oleg Nesterov 提交于
      Kill the empty thread_group_cputime_free() helper.  It was needed to free
      the per-cpu data which we no longer have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a705be6b
    • O
      signals: kill the awful task_rq_unlock_wait() hack · b7b8ff63
      Oleg Nesterov 提交于
      Now that task->signal can't go away we can revert the horrible hack added
      by ad474cac ("fix for
      account_group_exec_runtime(), make sure ->signal can't be freed under
      rq->lock").
      
      And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7b8ff63
    • O
      signals: make task_struct->signal immutable/refcountable · ea6d290c
      Oleg Nesterov 提交于
      We have a lot of problems with accessing task_struct->signal, it can
      "disappear" at any moment.  Even current can't use its ->signal safely
      after exit_notify().  ->siglock helps, but it is not convenient, not
      always possible, and sometimes it makes sense to use task->signal even
      after this task has already dead.
      
      This patch adds the reference counter, sigcnt, into signal_struct.  This
      reference is owned by task_struct and it is dropped in
      __put_task_struct().  Perhaps it makes sense to export
      get/put_signal_struct() later, but currently I don't see the immediate
      reason.
      
      Rename __cleanup_signal() to free_signal_struct() and unexport it.  With
      the previous changes it does nothing except kmem_cache_free().
      
      Change __exit_signal() to not clear/free ->signal, it will be freed when
      the last reference to any thread in the thread group goes away.
      
      Note:
      	- when the last thead exits signal->tty can point to nowhere, see
      	  the next patch.
      
      	- with or without this patch signal_struct->count should go away,
      	  or at least it should be "int nr_threads" for fs/proc. This will
      	  be addressed later.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6d290c
    • O
      exit: change zap_other_threads() to count sub-threads · 09faef11
      Oleg Nesterov 提交于
      Change zap_other_threads() to return the number of other sub-threads found
      on ->thread_group list.
      
      Other changes are cosmetic:
      
      	- change the code to use while_each_thread() helper
      
      	- remove the obsolete comment about SIGKILL/SIGSTOP
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09faef11
    • O
      umh: creds: kill subprocess_info->cred logic · c70a626d
      Oleg Nesterov 提交于
      Now that nobody ever changes subprocess_info->cred we can kill this member
      and related code.  ____call_usermodehelper() always runs in the context of
      freshly forked kernel thread, it has the proper ->cred copied from its
      parent kthread, keventd.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c70a626d
    • O
      umh: creds: convert call_usermodehelper_keys() to use subprocess_info->init() · 685bfd2c
      Oleg Nesterov 提交于
      call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
      subprocess_info->cred in advance.  Now that we have info->init() we can
      change this code to set tgcred->session_keyring in context of execing
      kernel thread.
      
      Note: since currently call_usermodehelper_keys() is never called with
      UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
      are not really needed, we could rely on install_session_keyring_to_cred()
      which does key_get() on success.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      685bfd2c
    • N
      exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit · 898b374a
      Neil Horman 提交于
      The first patch in this series introduced an init function to the
      call_usermodehelper api so that processes could be customized by caller.
      This patch takes advantage of that fact, by customizing the helper in
      do_coredump to create the pipe and set its core limit to one (for our
      recusrsion check).  This lets us clean up the previous uglyness in the
      usermodehelper internals and factor call_usermodehelper out entirely.
      While I'm at it, we can also modify the helper setup to look for a core
      limit value of 1 rather than zero for our recursion check
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      898b374a
    • N
      kmod: add init function to usermodehelper · a06a4dc3
      Neil Horman 提交于
      About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
      feature in the kernel works.  We had reports of several races, including
      some reports of apps bypassing our recursion check so that a process that
      was forked as part of a core_pattern setup could infinitely crash and
      refork until the system crashed.
      
      We fixed those by improving our recursion checks.  The new check basically
      refuses to fork a process if its core limit is zero, which works well.
      
      Unfortunately, I've been getting grief from maintainer of user space
      programs that are inserted as the forked process of core_pattern.  They
      contend that in order for their programs (such as abrt and apport) to
      work, all the running processes in a system must have their core limits
      set to a non-zero value, to which I say 'yes'.  I did this by design, and
      think thats the right way to do things.
      
      But I've been asked to ease this burden on user space enough times that I
      thought I would take a look at it.  The first suggestion was to make the
      recursion check fail on a non-zero 'special' number, like one.  That way
      the core collector process could set its core size ulimit to 1, and enable
      the kernel's recursion detection.  This isn't a bad idea on the surface,
      but I don't like it since its opt-in, in that if a program like abrt or
      apport has a bug and fails to set such a core limit, we're left with a
      recursively crashing system again.
      
      So I've come up with this.  What I've done is modify the
      call_usermodehelper api such that an extra parameter is added, a function
      pointer which will be called by the user helper task, after it forks, but
      before it exec's the required process.  This will give the caller the
      opportunity to get a call back in the processes context, allowing it to do
      whatever it needs to to the process in the kernel prior to exec-ing the
      user space code.  In the case of do_coredump, this callback is ues to set
      the core ulimit of the helper process to 1.  This elimnates the opt-in
      problem that I had above, as it allows the ulimit for core sizes to be set
      to the value of 1, which is what the recursion check looks for in
      do_coredump.
      
      This patch:
      
      Create new function call_usermodehelper_fns() and allow it to assign both
      an init and cleanup function, as we'll as arbitrary data.
      
      The init function is called from the context of the forked process and
      allows for customization of the helper process prior to calling exec.  Its
      return code gates the continuation of the process, or causes its exit.
      Also add an arbitrary data pointer to the subprocess_info struct allowing
      for data to be passed from the caller to the new process, and the
      subsequent cleanup process
      
      Also, use this patch to cleanup the cleanup function.  It currently takes
      an argp and envp pointer for freeing, which is ugly.  Lets instead just
      make the subprocess_info structure public, and pass that to the cleanup
      and init routines
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a06a4dc3
    • J
      cpusets: randomize node rotor used in cpuset_mem_spread_node() · 0ac0c0d0
      Jack Steiner 提交于
      Some workloads that create a large number of small files tend to assign
      too many pages to node 0 (multi-node systems).  Part of the reason is that
      the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
      node 0 for newly created tasks.
      
      This patch changes the rotor to be initialized to a random node number of
      the cpuset.
      
      [akpm@linux-foundation.org: fix layout]
      [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ac0c0d0
    • J
      cpusets: new round-robin rotor for SLAB allocations · 6adef3eb
      Jack Steiner 提交于
      We have observed several workloads running on multi-node systems where
      memory is assigned unevenly across the nodes in the system.  There are
      numerous reasons for this but one is the round-robin rotor in
      cpuset_mem_spread_node().
      
      For example, a simple test that writes a multi-page file will allocate
      pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
      allocates on odd nodes & skips even nodes).
      
      An example is shown below.  The program "lfile" writes a file consisting
      of 10 pages.  The program then mmaps the file & uses get_mempolicy(...,
      MPOL_F_NODE) to determine the nodes where the file pages were allocated.
      The output is shown below:
      
      	# ./lfile
      	 allocated on nodes: 2 4 6 0 1 2 6 0 2
      
      There is a single rotor that is used for allocating both file pages & slab
      pages.  Writing the file allocates both a data page & a slab page
      (buffer_head).  This advances the RR rotor 2 nodes for each page
      allocated.
      
      A quick confirmation seems to confirm this is the cause of the uneven
      allocation:
      
      	# echo 0 >/dev/cpuset/memory_spread_slab
      	# ./lfile
      	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5
      
      This patch introduces a second rotor that is used for slab allocations.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adef3eb
    • K
      cgroups: make cftype.unregister_event() void-returning · 907860ed
      Kirill A. Shutemov 提交于
      Since we are unable to handle an error returned by
      cftype.unregister_event() properly, let's make the callback
      void-returning.
      
      mem_cgroup_unregister_event() has been rewritten to be a "never fail"
      function.  On mem_cgroup_usage_register_event() we save old buffer for
      thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
      avoid allocation.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907860ed
    • A
      memcg: fix mis-accounting of file mapped racy with migration · ac39cf8c
      akpm@linux-foundation.org 提交于
      FILE_MAPPED per memcg of migrated file cache is not properly updated,
      because our hook in page_add_file_rmap() can't know to which memcg
      FILE_MAPPED should be counted.
      
      Basically, this patch is for fixing the bug but includes some big changes
      to fix up other messes.
      
      Now, at migrating mapped file, events happen in following sequence.
      
       1. allocate a new page.
       2. get memcg of an old page.
       3. charge ageinst a new page before migration. But at this point,
          no changes to new page's page_cgroup, no commit for the charge.
          (IOW, PCG_USED bit is not set.)
       4. page migration replaces radix-tree, old-page and new-page.
       5. page migration remaps the new page if the old page was mapped.
       6. Here, the new page is unlocked.
       7. memcg commits the charge for newpage, Mark the new page's page_cgroup
          as PCG_USED.
      
      Because "commit" happens after page-remap, we can count FILE_MAPPED
      at "5", because we should avoid to trust page_cgroup->mem_cgroup.
      if PCG_USED bit is unset.
      (Note: memcg's LRU removal code does that but LRU-isolation logic is used
       for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
       not on LRU or page_cgroup->mem_cgroup is NULL.)
      
      We can lose file_mapped accounting information at 5 because FILE_MAPPED
      is updated only when mapcount changes 0->1. So we should catch it.
      
      BTW, historically, above implemntation comes from migration-failure
      of anonymous page. Because we charge both of old page and new page
      with mapcount=0, we can't catch
        - the page is really freed before remap.
        - migration fails but it's freed before remap
      or .....corner cases.
      
      New migration sequence with memcg is:
      
       1. allocate a new page.
       2. mark PageCgroupMigration to the old page.
       3. charge against a new page onto the old page's memcg. (here, new page's pc
          is marked as PageCgroupUsed.)
       4. page migration replaces radix-tree, page table, etc...
       5. At remapping, new page's page_cgroup is now makrked as "USED"
          We can catch 0->1 event and FILE_MAPPED will be properly updated.
      
          And we can catch SWAPOUT event after unlock this and freeing this
          page by unmap() can be caught.
      
       7. Clear PageCgroupMigration of the old page.
      
      So, FILE_MAPPED will be correctly updated.
      
      Then, for what MIGRATION flag is ?
        Without it, at migration failure, we may have to charge old page again
        because it may be fully unmapped. "charge" means that we have to dive into
        memory reclaim or something complated. So, it's better to avoid
        charge it again. Before this patch, __commit_charge() was working for
        both of the old/new page and fixed up all. But this technique has some
        racy condtion around FILE_MAPPED and SWAPOUT etc...
        Now, the kernel use MIGRATION flag and don't uncharge old page until
        the end of migration.
      
      I hope this change will make memcg's page migration much simpler.  This
      page migration has caused several troubles.  Worth to add a flag for
      simplification.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac39cf8c
    • D
      memcg: move charge of file pages · 87946a72
      Daisuke Nishimura 提交于
      This patch adds support for moving charge of file pages, which include
      normal file, tmpfs file and swaps of tmpfs file.  It's enabled by setting
      bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
      
      Unlike the case of anonymous pages, file pages(and swaps) in the range
      mmapped by the task will be moved even if the task hasn't done page fault,
      i.e.  they might not be the task's "RSS", but other task's "RSS" that maps
      the same file.  And mapcount of the page is ignored(the page can be moved
      even if page_mapcount(page) > 1).  So, conditions that the page/swap
      should be met to be moved is that it must be in the range mmapped by the
      target task and it must be charged to the old cgroup.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87946a72