1. 17 5月, 2012 1 次提交
  2. 15 5月, 2012 1 次提交
  3. 08 5月, 2012 5 次提交
  4. 22 3月, 2012 4 次提交
  5. 08 3月, 2012 2 次提交
  6. 06 3月, 2012 6 次提交
    • H
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins 提交于
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
    • O
      vfork: kill PF_STARTING · 6e27f63e
      Oleg Nesterov 提交于
      Previously it was (ab)used by utrace.  Then it was wrongly used by the
      scheduler code.
      
      Currently it is not used, kill it before it finds the new erroneous user.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e27f63e
    • O
      coredump_wait: don't call complete_vfork_done() · 57b59c4a
      Oleg Nesterov 提交于
      Now that CLONE_VFORK is killable, coredump_wait() no longer needs
      complete_vfork_done().  zap_threads() should find and kill all tasks with
      the same ->mm, this includes our parent if ->vfork_done is set.
      
      mm_release() becomes the only caller, unexport complete_vfork_done().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57b59c4a
    • O
      vfork: make it killable · d68b46fe
      Oleg Nesterov 提交于
      Make vfork() killable.
      
      Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().  If it
      fails we do not return to the user-mode and never touch the memory shared
      with our child.
      
      However, in this case we should clear child->vfork_done before return, we
      use task_lock() in do_fork()->wait_for_vfork_done() and
      complete_vfork_done() to serialize with each other.
      
      Note: now that we use task_lock() we don't really need completion, we
      could turn task->vfork_done into "task_struct *wake_up_me" but this needs
      some complications.
      
      NOTE: this and the next patches do not affect in-kernel users of
      CLONE_VFORK, kernel threads run with all signals ignored including
      SIGKILL/SIGSTOP.
      
      However this is obviously the user-visible change.  Not only a fatal
      signal can kill the vforking parent, a sub-thread can do execve or
      exit_group() and kill the thread sleeping in vfork().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d68b46fe
    • O
      vfork: introduce complete_vfork_done() · c415c3b4
      Oleg Nesterov 提交于
      No functional changes.
      
      Move the clear-and-complete-vfork_done code into the new trivial helper,
      complete_vfork_done().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c415c3b4
    • M
      kmsg_dump: don't run on non-error paths by default · c22ab332
      Matthew Garrett 提交于
      Since commit 04c6862c ("kmsg_dump: add kmsg_dump() calls to the
      reboot, halt, poweroff and emergency_restart paths"), kmsg_dump() gets
      run on normal paths including poweroff and reboot.
      
      This is less than ideal given pstore implementations that can only
      represent single backtraces, since a reboot may overwrite a stored oops
      before it's been picked up by userspace.  In addition, some pstore
      backends may have low performance and provide a significant delay in
      reboot as a result.
      
      This patch adds a printk.always_kmsg_dump kernel parameter (which can also
      be changed from userspace).  Without it, the code will only be run on
      failure paths rather than on normal paths.  The option can be enabled in
      environments where there's a desire to attempt to audit whether or not a
      reboot was cleanly requested or not.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Acked-by: NSeiji Aguchi <seiji.aguchi@hds.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c22ab332
  7. 05 3月, 2012 2 次提交
  8. 03 3月, 2012 5 次提交
  9. 02 3月, 2012 2 次提交
  10. 29 2月, 2012 1 次提交
  11. 27 2月, 2012 2 次提交
    • J
      [PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional · 97a29d59
      James Bottomley 提交于
      The problem in
      
      commit fea80311
      Author: Randy Dunlap <rdunlap@xenotime.net>
      Date:   Sun Jul 24 11:39:14 2011 -0700
      
          iomap: make IOPORT/PCI mapping functions conditional
      
      is that if your architecture supplies pci_iomap/pci_iounmap, it expects
      always to supply them.  Adding empty body defitions in the !CONFIG_PCI
      case, which is what this patch does, breaks the parisc compile because
      the functions become doubly defined.  It took us a while to spot this,
      because we don't actually build !CONFIG_PCI very often (only if someone
      is brave enough to test the snake/asp machines).
      
      Since the note in the commit log says this is to fix a
      CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP
      supplies pci_iounmap only if CONFIG_PCI is set), there should actually
      have been a condition upon this.  This should make sure no other
      architecture's !CONFIG_PCI compile breaks in the same way as parisc.
      
      The fix had to be updated to take account of the GENERIC_PCI_IOMAP
      separation.
      Reported-by: NRolf Eike Beer <eike@sf-mail.de>
      Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
      97a29d59
    • L
      Fix autofs compile without CONFIG_COMPAT · 3c761ea0
      Linus Torvalds 提交于
      The autofs compat handling fix caused a compile failure when
      CONFIG_COMPAT isn't defined.
      
      Instead of adding random #ifdef'fery in autofs, let's just make the
      compat helpers earlier to use: without CONFIG_COMPAT, is_compat_task()
      just hardcodes to zero.
      
      We could probably do something similar for a number of other cases where
      we have #ifdef's in code, but this is the low-hanging fruit.
      Reported-and-tested-by: NAndreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c761ea0
  12. 25 2月, 2012 1 次提交
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
  13. 24 2月, 2012 3 次提交
  14. 22 2月, 2012 5 次提交
    • P
      sched/events: Revert trace_sched_stat_sleeptime() · 8c79a045
      Peter Zijlstra 提交于
      Commit 1ac9bc69 ("sched/tracing: Add a new tracepoint for sleeptime")
      added a new sched:sched_stat_sleeptime tracepoint.
      
      It's broken: the first sample we get on a task might be bad because
      of a stale sleep_start value that wasn't reset at the last task switch
      because the tracepoint was not active.
      
      It also breaks the existing schedstat samples due to the side
      effects of:
      
      -               se->statistics.sleep_start = 0;
      ...
      -               se->statistics.block_start = 0;
      
      Nor do I see means to fix it without adding overhead to the scheduler
      fast path, which I'm not willing to for the sake of redundant
      instrumentation.
      
      Most importantly, sleep time information can already be constructed
      by tracing context switches and wakeups, and taking the timestamp
      difference between the schedule-out, the wakeup and the schedule-in.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/n/tip-pc4c9qhl8q6vg3bs4j6k0rbd@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      8c79a045
    • L
      sys_poll: fix incorrect type for 'timeout' parameter · faf30900
      Linus Torvalds 提交于
      The 'poll()' system call timeout parameter is supposed to be 'int', not
      'long'.
      
      Now, the reason this matters is that right now 32-bit compat mode is
      broken on at least x86-64, because the 32-bit code just calls
      'sys_poll()' directly on x86-64, and the 32-bit argument will have been
      zero-extended, turning a signed 'int' into a large unsigned 'long'
      value.
      
      We could just introduce a 'compat_sys_poll()' function for this, and
      that may eventually be what we have to do, but since the actual standard
      poll() semantics is *supposed* to be 'int', and since at least on x86-64
      glibc sign-extends the argument before invocing the system call (so
      nobody can actually use a 64-bit timeout value in user space _anyway_,
      even in 64-bit binaries), the simpler solution would seem to be to just
      fix the definition of the system call to match what it should have been
      from the very start.
      
      If it turns out that somebody somehow circumvents the user-level libc
      64-bit sign extension and actually uses a large unsigned 64-bit timeout
      despite that not being how poll() is supposed to work, we will need to
      do the compat_sys_poll() approach.
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faf30900
    • H
      asm-generic: architecture independent readq/writeq for 32bit environment · 797a796a
      Hitoshi Mitake 提交于
      This provides unified readq()/writeq() helper functions for 32-bit
      drivers.
      
      For some cases, readq/writeq without atomicity is harmful, and order of
      io access has to be specified explicitly.  So in this patch, new two
      header files which contain non-atomic readq/writeq are added.
      
       - <asm-generic/io-64-nonatomic-lo-hi.h> provides non-atomic readq/
         writeq with the order of lower address -> higher address
      
       - <asm-generic/io-64-nonatomic-hi-lo.h> provides non-atomic readq/
         writeq with reversed order
      
      This allows us to remove some readq()s that were added drivers when the
      default non-atomic ones were removed in commit dbee8a0a ("x86:
      remove 32-bit versions of readq()/writeq()")
      
      The drivers which need readq/writeq but can do with the non-atomic ones
      must add the line:
      
        #include <asm-generic/io-64-nonatomic-lo-hi.h> /* or hi-lo.h */
      
      But this will be nop in 64-bit environments, and no other #ifdefs are
      required.  So I believe that this patch can solve the problem of
       1. driver-specific readq/writeq
       2. atomicity and order of io access
      
      This patch is tested with building allyesconfig and allmodconfig as
      ARCH=x86 and ARCH=i386 on top of tip/master.
      
      Cc: Kashyap Desai <Kashyap.Desai@lsi.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Ravi Anand <ravi.anand@qlogic.com>
      Cc: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
      Cc: Matthew Garrett <mjg@redhat.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: James Bottomley <James.Bottomley@parallels.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHitoshi Mitake <h.mitake@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      797a796a
    • G
      rtnetlink: Fix problem with buffer allocation · 115c9b81
      Greg Rose 提交于
      Implement a new netlink attribute type IFLA_EXT_MASK.  The mask
      is a 32 bit value that can be used to indicate to the kernel that
      certain extended ifinfo values are requested by the user application.
      At this time the only mask value defined is RTEXT_FILTER_VF to
      indicate that the user wants the ifinfo dump to send information
      about the VFs belonging to the interface.
      
      This patch fixes a bug in which certain applications do not have
      large enough buffers to accommodate the extra information returned
      by the kernel with large numbers of SR-IOV virtual functions.
      Those applications will not send the new netlink attribute with
      the interface info dump request netlink messages so they will
      not get unexpectedly large request buffers returned by the kernel.
      
      Modifies the rtnl_calcit function to traverse the list of net
      devices and compute the minimum buffer size that can hold the
      info dumps of all matching devices based upon the filter passed
      in via the new netlink attribute filter mask.  If no filter
      mask is sent then the buffer allocation defaults to NLMSG_GOODSIZE.
      
      With this change it is possible to add yet to be defined netlink
      attributes to the dump request which should make it fairly extensible
      in the future.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      115c9b81
    • M
      percpu: use raw_local_irq_* in _this_cpu op · e920d597
      Ming Lei 提交于
      It doesn't make sense to trace irq off or do irq flags
      lock proving inside 'this_cpu' operations, so replace local_irq_*
      with raw_local_irq_* in 'this_cpu' op.
      
      Also the patch fixes onelockdep warning[1] by the replacement, see
      below:
      
      In commit: 933393f5(percpu:
      Remove irqsafe_cpu_xxx variants), local_irq_save/restore(flags) are
      added inside this_cpu_inc operation, so that trace_hardirqs_off_caller
      will be called by trace_hardirqs_on_caller directly because
      __debug_atomic_inc is implemented as this_cpu_inc, which may trigger
      the lockdep warning[1], for example in the below ARM scenary:
      
      	kernel_thread_helper	/*irq disabled*/
      		->trace_hardirqs_on_caller	/*hardirqs_enabled was set*/
      			->trace_hardirqs_off_caller	/*hardirqs_enabled cleared*/
      				__this_cpu_add(redundant_hardirqs_on)
      			->trace_hardirqs_off_caller	/*irq disabled, so call here*/
      
      The 'unannotated irqs-on' warning will be triggered somewhere because
      irq is just enabled after the irq trace in kernel_thread_helper.
      
      [1],
      [    0.162841] ------------[ cut here ]------------
      [    0.167694] WARNING: at kernel/lockdep.c:3493 check_flags+0xc0/0x1d0()
      [    0.174468] Modules linked in:
      [    0.177703] Backtrace:
      [    0.180328] [<c00171f0>] (dump_backtrace+0x0/0x110) from [<c0412320>] (dump_stack+0x18/0x1c)
      [    0.189086]  r6:c051f778 r5:00000da5 r4:00000000 r3:60000093
      [    0.195007] [<c0412308>] (dump_stack+0x0/0x1c) from [<c00410e8>] (warn_slowpath_common+0x54/0x6c)
      [    0.204223] [<c0041094>] (warn_slowpath_common+0x0/0x6c) from [<c0041124>] (warn_slowpath_null+0x24/0x2c)
      [    0.214111]  r8:00000000 r7:00000000 r6:ee069598 r5:60000013 r4:ee082000
      [    0.220825] r3:00000009
      [    0.223693] [<c0041100>] (warn_slowpath_null+0x0/0x2c) from [<c0088f38>] (check_flags+0xc0/0x1d0)
      [    0.232910] [<c0088e78>] (check_flags+0x0/0x1d0) from [<c008d348>] (lock_acquire+0x4c/0x11c)
      [    0.241668] [<c008d2fc>] (lock_acquire+0x0/0x11c) from [<c0415aa4>] (_raw_spin_lock+0x3c/0x74)
      [    0.250610] [<c0415a68>] (_raw_spin_lock+0x0/0x74) from [<c010a844>] (set_task_comm+0x20/0xc0)
      [    0.259521]  r6:ee069588 r5:ee0691c0 r4:ee082000
      [    0.264404] [<c010a824>] (set_task_comm+0x0/0xc0) from [<c0060780>] (kthreadd+0x28/0x108)
      [    0.272857]  r8:00000000 r7:00000013 r6:c0044a08 r5:ee0691c0 r4:ee082000
      [    0.279571] r3:ee083fe0
      [    0.282470] [<c0060758>] (kthreadd+0x0/0x108) from [<c0044a08>] (do_exit+0x0/0x6dc)
      [    0.290405]  r5:c0060758 r4:00000000
      [    0.294189] ---[ end trace 1b75b31a2719ed1c ]---
      [    0.299041] possible reason: unannotated irqs-on.
      [    0.303955] irq event stamp: 5
      [    0.307159] hardirqs last  enabled at (4): [<c001331c>] no_work_pending+0x8/0x2c
      [    0.314880] hardirqs last disabled at (5): [<c0089b08>] trace_hardirqs_on_caller+0x60/0x26c
      [    0.323547] softirqs last  enabled at (0): [<c003f754>] copy_process+0x33c/0xef4
      [    0.331207] softirqs last disabled at (0): [<  (null)>]   (null)
      [    0.337585] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e920d597