1. 10 5月, 2007 40 次提交
    • R
      wrap access to thread_info · c9f4f06d
      Roman Zippel 提交于
      Recently a few direct accesses to the thread_info in the task structure snuck
      back, so this wraps them with the appropriate wrapper.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4f06d
    • R
      Allow arch to initialize arch field of the module structure · e61a1c1c
      Roman Zippel 提交于
      This will later allow an arch to add module specific information via linker
      generated tables instead of poking directly in the module object structure.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e61a1c1c
    • T
      clocksource: fix resume logic · b52f52a0
      Thomas Gleixner 提交于
      We need to make sure that the clocksources are resumed, when timekeeping is
      resumed.  The current resume logic does not guarantee this.
      
      Add a resume function pointer to the clocksource struct, so clocksource
      drivers which need to reinitialize the clocksource can provide a resume
      function.
      
      Add a resume function, which calls the maybe available clocksource resume
      functions and resets the watchdog function, so a stable TSC can be used
      accross suspend/resume.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b52f52a0
    • C
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter 提交于
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
    • C
      vmstat: use our own timer events · d1187ed2
      Christoph Lameter 提交于
      vmstat is currently using the cache reaper to periodically bring the
      statistics up to date.  The cache reaper does only exists in SLUB as a way to
      provide compatibility with SLAB.  This patch removes the vmstat calls from the
      slab allocators and provides its own handling.
      
      The advantage is also that we can use a different frequency for the updates.
      Refreshing vm stats is a pretty fast job so we can run this every second and
      stagger this by only one tick.  This will lead to some overlap in large
      systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
      updates occurring at once.
      
      However, the vm stats update only accesses per node information.  It is only
      necessary to stagger the vm statistics updates per processor in each node.  Vm
      counter updates occurring on distant nodes will not cause cacheline
      contention.
      
      We could implement an alternate approach that runs the first processor on each
      node at the second and then each of the other processor on a node on a
      subsequent tick.  That may be useful to keep a large amount of the second free
      of timer activity.  Maybe the timer folks will have some feedback on this one?
      
      [jirislaby@gmail.com: add missing break]
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1187ed2
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
    • N
      fs: deprecate memclear_highpage_flush · f37bc271
      Nate Diller 提交于
      Now that all the in-tree users are converted over to zero_user_page(),
      deprecate the old memclear_highpage_flush() call.
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f37bc271
    • N
      fs: convert core functions to zero_user_page · 01f2705d
      Nate Diller 提交于
      It's very common for file systems to need to zero part or all of a page,
      the simplist way is just to use kmap_atomic() and memset().  There's
      actually a library function in include/linux/highmem.h that does exactly
      that, but it's confusingly named memclear_highpage_flush(), which is
      descriptive of *how* it does the work rather than what the *purpose* is.
      So this patchset renames the function to zero_user_page(), and calls it
      from the various places that currently open code it.
      
      This first patch introduces the new function call, and converts all the
      core kernel callsites, both the open-coded ones and the old
      memclear_highpage_flush() ones.  Following this patch is a series of
      conversions for each file system individually, per AKPM, and finally a
      patch deprecating the old call.  The diffstat below shows the entire
      patchset.
      
      [akpm@linux-foundation.org: fix a few things]
      Signed-off-by: NNate Diller <nate.diller@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01f2705d
    • E
      FUTEX: new PRIVATE futexes · 34f01cc1
      Eric Dumazet 提交于
        Analysis of current linux futex code :
        --------------------------------------
      
      A central hash table futex_queues[] holds all contexts (futex_q) of waiting
      threads.
      
      Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
      perform lookups or insert/deletion of a futex_q.
      
      When a futex_wait() is done, calling thread has to :
      
      1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
           (calling find_vma()). This validation tells us if the futex uses
           an inode based store (mapped file), or mm based store (anonymous mem)
      
      2) - compute a hash key
      
      3) - Atomic increment of reference counter on an inode or a mm_struct
      
      4) - lock part of futex_queues[] hash table
      
      5) - perform the test on value of futex.
      	(rollback is value != expected_value, returns EWOULDBLOCK)
      	(various loops if test triggers mm faults)
      
      6) queue the context into hash table, release the lock got in 4)
      
      7) - release the read_lock on mmap_sem
      
         <block>
      
      8) Eventually unqueue the context (but rarely, as this part  may be done
         by the futex_wake())
      
      Futexes were designed to improve scalability but current implementation has
      various problems :
      
      - Central hashtable :
      
        This means scalability problems if many processes/threads want to use
        futexes at the same time.
        This means NUMA unbalance because this hashtable is located on one node.
      
      - Using mmap_sem on every futex() syscall :
      
        Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
        ops on mmap_sem, dirtying cache line :
          - lot of cache line ping pongs on SMP configurations.
      
        mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
        Highly threaded processes might suffer from mmap_sem contention.
      
        mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
        programs because of contention on the mmap_sem cache line.
      
      - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
        It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
        because of cache misses.
      
      Most of these scalability problems come from the fact that futexes are in
      one global namespace.  As we use a central hash table, we must make sure
      they are all using the same reference (given by the mm subsystem).  We
      chose to force all futexes be 'shared'.  This has a cost.
      
      But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
      and optimal performance if carefuly implemented.  Time has come for linux
      to have better threading performance.
      
      The goal is to permit new futex commands to avoid :
       - Taking the mmap_sem semaphore, conflicting with other subsystems.
       - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.
      
      This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
      futexes, we only need to distinguish futexes by their virtual address, no
      matter the underlying mm storage is.
      
      If glibc wants to exploit this new infrastructure, it should use new
      _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
      prepared to fallback on old subcommands for old kernels.  Using one global
      variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.
      
      PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.
      
      Compatibility with old applications is preserved, they still hit the
      scalability problems, but new applications can fly :)
      
      Note : the same SHARED futex (mapped on a file) can be used by old binaries
      *and* new binaries, because both binaries will use the old subcommands.
      
      Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
      as this is the default semantic. Almost all applications should benefit
      of this changes (new kernel and updated libc)
      
      Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)
      
      /* calling futex_wait(addr, value) with value != *addr */
      433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
      424 cycles per futex(FUTEX_WAIT) call (using one futex)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
      334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
      For reference :
      187 cycles per getppid() call
      188 cycles per umask() call
      181 cycles per ni_syscall() call
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Pierre Peiffer <pierre.peiffer@bull.net>
      Cc: "Ulrich Drepper" <drepper@gmail.com>
      Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
      Cc: "Ingo Molnar" <mingo@elte.hu>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f01cc1
    • P
      futex_requeue_pi optimization · d0aa7a70
      Pierre Peiffer 提交于
      This patch provides the futex_requeue_pi functionality, which allows some
      threads waiting on a normal futex to be requeued on the wait-queue of a
      PI-futex.
      
      This provides an optimization, already used for (normal) futexes, to be used
      with the PI-futexes.
      
      This optimization is currently used by the glibc in pthread_broadcast, when
      using "normal" mutexes.  With futex_requeue_pi, it can be used with
      PRIO_INHERIT mutexes too.
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0aa7a70
    • P
      Make futex_wait() use an hrtimer for timeout · c19384b5
      Pierre Peiffer 提交于
      This patch modifies futex_wait() to use an hrtimer + schedule() in place of
      schedule_timeout().
      
      schedule_timeout() is tick based, therefore the timeout granularity is the
      tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
      for timeout wakeup, we can attain a much finer timeout granularity (in the
      microsecond range).  This parallels what is already done for futex_lock_pi().
      
      The timeout passed to the syscall is no longer converted to jiffies and is
      therefore passed to do_futex() and futex_wait() as an absolute ktime_t
      therefore keeping nanosecond resolution.
      
      Also this removes the need to pass the nanoseconds timeout part to
      futex_lock_pi() in val2.
      
      In futex_wait(), if there is no timeout then a regular schedule() is
      performed.  Otherwise, an hrtimer is fired before schedule() is called.
      
      [akpm@linux-foundation.org: fix `make headers_check']
      Signed-off-by: NSebastien Dugue <sebastien.dugue@bull.net>
      Signed-off-by: NPierre Peiffer <pierre.peiffer@bull.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c19384b5
    • A
      declare struct ktime · f34c506b
      Andrew Morton 提交于
      Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
      forward declare it.
      
      Create a new `union ktime', map ktime_t onto that.  Now we need to kill off
      this ktime_t thing.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f34c506b
    • A
      aio is unlikely · b8522ead
      Andrew Morton 提交于
      Stick an unlikely() around is_aio(): I assert that most IO is synchronous.
      
      Cc: Suparna Bhattacharya <suparna@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8522ead
    • J
      RPC: add wrapper for svc_reserve to account for checksum · cd123012
      Jeff Layton 提交于
      When the kernel calls svc_reserve to downsize the expected size of an RPC
      reply, it fails to account for the possibility of a checksum at the end of
      the packet.  If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
      then you'll generally see messages similar to this in the server's ring
      buffer:
      
      RPC request reserved 164 but used 208
      
      While I was never able to verify it, I suspect that this problem is also
      the root cause of some oopses I've seen under these conditions:
      
      https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227726
      
      This is probably also a problem for other sec= types and for NFSv4.  The
      large reserved size for NFSv4 compound packets seems to generally paper
      over the problem, however.
      
      This patch adds a wrapper for svc_reserve that accounts for the possibility
      of a checksum.  It also fixes up the appropriate callers of svc_reserve to
      call the wrapper.  For now, it just uses a hardcoded value that I
      determined via testing.  That value may need to be revised upward as things
      change, or we may want to eventually add a new auth_op that attempts to
      calculate this somehow.
      
      Unfortunately, there doesn't seem to be a good way to reliably determine
      the expected checksum length prior to actually calculating it, particularly
      with schemes like spkm3.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NNeil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd123012
    • N
      knfsd: rename sk_defer_lock to sk_lock · 7ac1bea5
      NeilBrown 提交于
      Now that sk_defer_lock protects two different things, make the name more
      generic.
      
      Also don't bother with disabling _bh as the lock is only ever taken from
      process context.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ac1bea5
    • A
      remove nfs4_acl_add_ace() · 8842c965
      Adrian Bunk 提交于
      nfs4_acl_add_ace() can now be removed.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Acked-by: NNeil Brown <neilb@cse.unsw.edu.au>
      Acked-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8842c965
    • O
      change kernel threads to ignore signals instead of blocking them · 10ab825b
      Oleg Nesterov 提交于
      Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
      signals.  This doesn't prevent the signal delivery, this only blocks
      signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
      leak.
      
      Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
      them (make a new helper ignore_signals() for that).  If the kernel thread
      needs some signal, it should use allow_signal() anyway, and in that case it
      should not use CLONE_SIGHAND.
      
      Note that we can't change daemonize() (should die!) in the same way,
      because it can be used along with CLONE_SIGHAND.  This means that
      allow_signal() still should unblock the signal to work correctly with
      daemonize()ed threads.
      
      However, disallow_signal() doesn't block the signal any longer but ignores
      it.
      
      NOTE: with or without this patch the kernel threads are not protected from
      handle_stop_signal(), this seems harmless, but not good.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10ab825b
    • E
      kthread: don't depend on work queues · 73c27992
      Eric W. Biederman 提交于
      Currently there is a circular reference between work queue initialization
      and kthread initialization.  This prevents the kthread infrastructure from
      initializing until after work queues have been initialized.
      
      We want the properties of tasks created with kthread_create to be as close
      as possible to the init_task and to not be contaminated by user processes.
      The later we start our kthreadd that creates these tasks the harder it is
      to avoid contamination from user processes and the more of a mess we have
      to clean up because the defaults have changed on us.
      
      So this patch modifies the kthread support to not use work queues but to
      instead use a simple list of structures, and to have kthreadd start from
      init_task immediately after our kernel thread that execs /sbin/init.
      
      By being a true child of init_task we only have to change those process
      settings that we want to have different from init_task, such as our process
      name, the cpus that are allowed, blocking all signals and setting SIGCHLD
      to SIG_IGN so that all of our children are reaped automatically.
      
      By being a true child of init_task we also naturally get our ppid set to 0
      and do not wind up as a child of PID == 1.  Ensuring that tasks generated
      by kthread_create will not slow down the functioning of the wait family of
      functions.
      
      [akpm@linux-foundation.org: use interruptible sleeps]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73c27992
    • O
      unify flush_work/flush_work_keventd and rename it to cancel_work_sync · 28e53bdd
      Oleg Nesterov 提交于
      flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
      (this was possible from the very beginnig, I missed this).  So we can unify
      flush_work_keventd and flush_work.
      
      Also, rename flush_work() to cancel_work_sync() and fix all callers.
      Perhaps this is not the best name, but "flush_work" is really bad.
      
      (akpm: this is why the earlier patches bypassed maintainers)
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Auke Kok <auke-jan.h.kok@intel.com>,
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28e53bdd
    • O
      workqueue: kill NOAUTOREL works · 23b2e599
      Oleg Nesterov 提交于
      We don't have any users, and it is not so trivial to use NOAUTOREL works
      correctly.  It is better to simplify API.
      
      Delete NOAUTOREL support and rename work_release to work_clear_pending to
      avoid a confusion.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2e599
    • O
      make cancel_rearming_delayed_work() work on any workqueue, not just keventd_wq · 1634c48f
      Oleg Nesterov 提交于
      cancel_rearming_delayed_workqueue(wq, dwork) doesn't need the first
      parameter.  We don't hang on un-queued dwork any longer, and work->data
      doesn't change its type.  This means we can always figure out "wq" from
      dwork when it is needed.
      
      Remove this parameter, and rename the function to
      cancel_rearming_delayed_work().  Re-create an inline "obsolete"
      cancel_rearming_delayed_workqueue(wq) which just calls
      cancel_rearming_delayed_work().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1634c48f
    • O
      workqueue: kill run_scheduled_work() · 7097a87a
      Oleg Nesterov 提交于
      Because it has no callers.
      
      Actually, I think the whole idea of run_scheduled_work() was not right, not
      good to mix "unqueue this work and execute its ->func()" in one function.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7097a87a
    • G
      Define and use new events,CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE · baaca49f
      Gautham R Shenoy 提交于
      This is an attempt to provide an alternate mechanism for postponing
      a hotplug event instead of using a global mechanism like lock_cpu_hotplug.
      
      The proposal is to add two new events namely CPU_LOCK_ACQUIRE and
      CPU_LOCK_RELEASE. The notification for these two events would be sent
      out before and after a cpu_hotplug event respectively.
      
      During the CPU_LOCK_ACQUIRE event, a cpu-hotplug-aware subsystem is
      supposed to acquire any per-subsystem hotcpu mutex ( Eg. workqueue_mutex
      in kernel/workqueue.c ).
      
      During the CPU_LOCK_RELEASE release event the cpu-hotplug-aware subsystem
      is supposed to release the per-subsystem hotcpu mutex.
      
      The reasons for defining new events as opposed to reusing the existing events
      like CPU_UP_PREPARE/CPU_UP_FAILED/CPU_ONLINE for locking/unlocking of
      per-subsystem hotcpu mutexes are as follow:
      
      	- CPU_LOCK_ACQUIRE: All hotcpu mutexes are taken before subsystems
      	start handling pre-hotplug events like CPU_UP_PREPARE/CPU_DOWN_PREPARE
      	etc, thus ensuring a clean handling of these events.
      
      	- CPU_LOCK_RELEASE: The hotcpu mutexes will be released only after
      	all subsystems have handled post-hotplug events like CPU_DOWN_FAILED,
      	CPU_DEAD,CPU_ONLINE etc thereby ensuring that there are no subsequent
      	clashes amongst the interdependent subsystems after a cpu hotplugs.
      
      This patch also uses __raw_notifier_call chain in _cpu_up to take care
      of the dependency between the two consequetive calls to
      raw_notifier_call_chain.
      
      [akpm@linux-foundation.org: fix a bug]
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baaca49f
    • G
      Extend notifier_call_chain to count nr_calls made · 6f7cc11a
      Gautham R Shenoy 提交于
      Since 2.6.18-something, the community has been bugged by the problem to
      provide a clean and a stable mechanism to postpone a cpu-hotplug event as
      lock_cpu_hotplug was badly broken.
      
      This is another proposal towards solving that problem.  This one is along the
      lines of the solution provided in kernel/workqueue.c
      
      Instead of having a global mechanism like lock_cpu_hotplug, we allow the
      subsytems to define their own per-subsystem hot cpu mutexes.  These would be
      taken(released) where ever we are currently calling
      lock_cpu_hotplug(unlock_cpu_hotplug).
      
      Also, in the per-subsystem hotcpu callback function,we take this mutex before
      we handle any pre-cpu-hotplug events and release it once we finish handling
      the post-cpu-hotplug events.  A standard means for doing this has been
      provided in [PATCH 2/4] and demonstrated in [PATCH 3/4].
      
      The ordering of these per-subsystem mutexes might still prove to be a
      problem, but hopefully lockdep should help us get out of that muddle.
      
      The patch set to be applied against linux-2.6.19-rc5 is as follows:
      
      [PATCH 1/4] :	Extend notifier_call_chain with an option to specify the
      		number of notifications to be sent and also count the
      		number of notifications actually sent.
      
      [PATCH 2/4] :	Define events CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE
      		and send out notifications for these in _cpu_up and
      		_cpu_down. This would help us standardise the acquire and
      		release of the subsystem locks in the hotcpu
      		callback functions of these subsystems.
      
      [PATCH 3/4] :	Eliminate lock_cpu_hotplug from kernel/sched.c.
      
      [PATCH 4/4] :	In workqueue_cpu_callback function, acquire(release) the
      		workqueue_mutex while handling
      		CPU_LOCK_ACQUIRE(CPU_LOCK_RELEASE).
      
      If the per-subsystem-locking approach survives the test of time, we can expect
      a slow phasing out of lock_cpu_hotplug, which has not yet been eliminated in
      these patches :)
      
      This patch:
      
      Provide notifier_call_chain with an option to call only a specified number of
      notifiers and also record the number of call to notifiers made.
      
      The need for this enhancement was identified in the post entitled
      "Slab - Eliminate lock_cpu_hotplug from slab"
      (http://lkml.org/lkml/2006/10/28/92) by Ravikiran G Thirumalai and
      Andrew Morton.
      
      This patch adds two additional parameters to notifier_call_chain API namely
       - int nr_to_calls : Number of notifier_functions to be called.
       		     The don't care value is -1.
      
       - unsigned int *nr_calls : Records the total number of notifier_funtions
      			    called by notifier_call_chain. The don't care
      			    value is NULL.
      
      [michal.k.k.piotrowski@gmail.com: build fix]
      Credit: Andrew Morton <akpm@osdl.org>
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NMichal Piotrowski <michal.k.k.piotrowski@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f7cc11a
    • T
      relay: use plain timer instead of delayed work · 7c9cb383
      Tom Zanussi 提交于
      relay doesn't need to use schedule_delayed_work() for waking readers
      when a simple timer will do.
      Signed-off-by: NTom Zanussi <zanussi@comcast.net>
      Cc: Satyam Sharma <satyam.sharma@gmail.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c9cb383
    • A
      kblockd: use flush_work · 19a75d83
      Andrew Morton 提交于
      Switch the kblockd flushing from a global flush to a more specific
      flush_work().
      
      (akpm: bypassed maintainers, sorry.  There are other patches which depend on
      this)
      
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19a75d83
    • O
      implement flush_work() · b89deed3
      Oleg Nesterov 提交于
      A basic problem with flush_scheduled_work() is that it blocks behind _all_
      presently-queued works, rather than just the work whcih the caller wants to
      flush.  If the caller holds some lock, and if one of the queued work happens
      to want that lock as well then accidental deadlocks can occur.
      
      One example of this is the phy layer: it wants to flush work while holding
      rtnl_lock().  But if a linkwatch event happens to be queued, the phy code will
      deadlock because the linkwatch callback function takes rtnl_lock.
      
      So we implement a new function which will flush a *single* work - just the one
      which the caller wants to free up.  Thus we avoid the accidental deadlocks
      which can arise from unrelated subsystems' callbacks taking shared locks.
      
      flush_work() non-blockingly dequeues the work_struct which we want to kill,
      then it waits for its handler to complete on all CPUs.
      
      Add ->current_work to the "struct cpu_workqueue_struct", it points to
      currently running "struct work_struct". When flush_work(work) detects
      ->current_work == work, it inserts a barrier at the _head_ of ->worklist
      (and thus right _after_ that work) and waits for completition. This means
      that the next work fired on that CPU will be this barrier, or another
      barrier queued by concurrent flush_work(), so the caller of flush_work()
      will be woken before any "regular" work has a chance to run.
      
      When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
      against CPU hotplug), CPU may go away. But in that case take_over_work() will
      move a barrier we queued to another CPU, it will be fired sometime, and
      wait_on_work() will be woken.
      
      Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
      take_over_work(), so cwq->thread should complete its ->worklist (and thus
      the barrier), because currently we don't check kthread_should_stop() in
      run_workqueue(). But even if we did, everything should be ok.
      
      [akpm@osdl.org: cleanup]
      [akpm@osdl.org: add flush_work_keventd() wrapper]
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b89deed3
    • H
      Use common cpu_is_xxx() macros on AT91 and AVR32 · e7498281
      Haavard Skinnemoen 提交于
      Several drivers shared between AT91 and AVR32 chips use cpu_is_xxx()
      to handle CPU-specific differences. Currently, such code needs to be
      inside #ifdef CONFIG_ARCH_AT91 because the macros don't exist on AVR32.
      
      By defining the same macros on both AT91 and AVR32, these #ifdefs can
      be eliminated. Since the macros will evaluate to a constant value for
      CPUs that aren't supported by the current architecture, any code that
      is only needed on AT91 will be optimized away on AVR32 and vice versa.
      Signed-off-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: David Brownell <david-b@pacbell.net>
      Acked-by: NAndrew Victor <andrew@sanpeople.com>
      Cc: Nicolas Ferre <nicolas.ferre@rfo.atmel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7498281
    • A
      mutex_lock_interruptible(): add __must_check · 18d8362d
      Andrew Morton 提交于
      It's not sane to use mutex_lock_interruptible() and to then ignore the result.
      
      Ditto down_interruptible(), but I'm lazy.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18d8362d
    • R
      Move sig_kernel_* et al macros to linux/signal.h · 55c0d1f8
      Roland McGrath 提交于
      This patch moves the sig_kernel_* and related macros from kernel/signal.c
      to linux/signal.h, and cleans them up slightly.  I need the sig_kernel_*
      macros for default signal behavior in the utrace code, and want to avoid
      duplication or overhead to share the knowledge.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55c0d1f8
    • J
      mca: add integrated device bus matching · 8813d1c0
      James Bottomley 提交于
      The MCA bus has a few "integrated" functions, which are effectively virtual
      slots on the bus.  The problem is that these special functions don't have
      dedicated pos IDs, so we have to manufacture ids for them outside the pos
      space ...  and these ids can't be matched by the standard matching function,
      so add a special registration that requests a list of pos ids or a particular
      integrated function.
      Signed-off-by: NJames Bottomley <James.Bottomley@SteelEye.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8813d1c0
    • F
      Always ask the hardware to obtain hardware processor id - ia64 · 818563dc
      Fernando Luis Vazquez Cao 提交于
      Always ask the hardware to determine the hardware processor id in both UP and
      SMP kernels.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      818563dc
    • F
      Use the APIC to determine the hardware processor id - x86_64 · dd988528
      Fernando Luis Vazquez Cao 提交于
      hard_smp_processor_id used to be just a macro that hard-coded
      hard_smp_processor_id to 0 in the non SMP case.  When booting non SMP kernels
      on hardware where the boot ioapic id is not 0 this turns out to be a problem.
      This is happens frequently in the case of kdump and once in a great while in
      the case of real hardware.
      
      Use the APIC to determine the hardware processor id in both UP and SMP kernels
      to fix this issue.
      
      Notice that hard_smp_processor_id is only used by SMP code or by code that
      works with apics so we do not need to handle the case when apics are not
      present and hard_smp_processor_id should never be called there.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd988528
    • F
      Use the APIC to determine the hardware processor id - i386 · a36166c6
      Fernando Luis Vazquez Cao 提交于
      hard_smp_processor_id used to be just a macro that hard-coded
      hard_smp_processor_id to 0 in the non SMP case.  When booting non SMP kernels
      on hardware where the boot ioapic id is not 0 this turns out to be a problem.
      This is happens frequently in the case of kdump and once in a great while in
      the case of real hardware.
      
      Use the APIC to determine the hardware processor id in both UP and SMP kernels
      to fix this issue.
      
      Notice that hard_smp_processor_id is only used by SMP code or by code that
      works with apics so we do not need to handle the case when apics are not
      present and hard_smp_processor_id should never be called there.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36166c6
    • F
      Remove hardcoding of hard_smp_processor_id on UP systems · 2f4dfe20
      Fernando Luis Vazquez Cao 提交于
      With the advent of kdump, the assumption that the boot CPU when booting an UP
      kernel is always the CPU with a particular hardware ID (often 0) (usually
      referred to as BSP on some architectures) is not valid anymore.  The reason
      being that the dump capture kernel boots on the crashed CPU (the CPU that
      invoked crash_kexec), which may be or may not be that particular CPU.
      
      Move definition of hard_smp_processor_id for the UP case to
      architecture-specific code ("asm/smp.h") where it belongs, so that each
      architecture can provide its own implementation.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f4dfe20
    • D
      Display all possible partitions when the root filesystem failed to mount · dd2a345f
      Dave Gilbert 提交于
      Display all possible partitions when the root filesystem is not mounted.
      This helps to track spell'o's and missing drivers.
      
      Updated to work with newer kernels.
      
      Example output:
      
      VFS: Cannot open root device "foobar" or unknown-block(0,0)
      Please append a correct "root=" boot option; here are the available partitions:
      0800    8388608 sda driver: sd
        0801     192748 sda1
        0802    8193150 sda2
      0810    4194304 sdb driver: sd
      Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
      
      [akpm@linux-foundation.org: cleanups, fix printk warnings]
      Signed-off-by: NJan Engelhardt <jengelh@gmx.de>
      Cc: Dave Gilbert <linux@treblig.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd2a345f
    • J
      uml: fix build breakage · 1e0cb0c3
      Jeff Dike 提交于
      UML now needs required-features.h to build - an empty one suffices.
      Signed-off-by: NJeff Dike <jdike@linux.intel.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e0cb0c3
    • R
      PM: Separate hibernation code from suspend code · a3d25c27
      Rafael J. Wysocki 提交于
      [ With Johannes Berg <johannes@sipsolutions.net> ]
      
      Separate the hibernation (aka suspend to disk code) from the other suspend
      code.  In particular:
      
       * Remove the definitions related to hibernation from include/linux/pm.h
       * Introduce struct hibernation_ops and a new hibernate() function to hibernate
         the system, defined in include/linux/suspend.h
       * Separate suspend code in kernel/power/main.c from hibernation-related code
         in kernel/power/disk.c and kernel/power/user.c (with the help of
         hibernation_ops)
       * Switch ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3d25c27
    • C
      FRV: Replace pgd management via slabs through quicklists · 8defab33
      Christoph Lameter 提交于
      This is done in order to be able to run SLUB which expects no modifications
      to its page structs.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8defab33
    • S
      Declare {compat_}sys_utimensat · 97416ce8
      Stephen Rothwell 提交于
      This is needed before Powerpc can wire up the syscall.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97416ce8