1. 30 9月, 2016 7 次提交
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • P
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra 提交于
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      24fc7edb
    • O
      sched/wait: Introduce init_wait_entry() · 0176beaf
      Oleg Nesterov 提交于
      The partial initialization of wait_queue_t in prepare_to_wait_event() looks
      ugly. This was done to shrink .text, but we can simply add the new helper
      which does the full initialization and shrink the compiled code a bit more.
      
      And. This way prepare_to_wait_event() can have more users. In particular we
      are ready to remove the signal_pending_state() checks from wait_bit_action_f
      helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0176beaf
    • O
      sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() · eaf9ef52
      Oleg Nesterov 提交于
      __wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
      now it can't use prepare_to_wait_event() (see the next change), but
      it can do the additional finish_wait() if action() fails.
      
      abort_exclusive_wait() no longer has callers, remove it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eaf9ef52
    • O
      sched/wait: Avoid abort_exclusive_wait() in ___wait_event() · b1ea06a9
      Oleg Nesterov 提交于
      ___wait_event() doesn't really need abort_exclusive_wait(), we can simply
      change prepare_to_wait_event() to remove the waiter from q->task_list if
      it was interrupted.
      
      This simplifies the code/logic, and this way prepare_to_wait_event() can
      have more users, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      --
       include/linux/wait.h |    7 +------
       kernel/sched/wait.c  |   35 +++++++++++++++++++++++++----------
       2 files changed, 26 insertions(+), 16 deletions(-)
      b1ea06a9
    • O
      sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up() · 38a3e1fc
      Oleg Nesterov 提交于
      Otherwise this logic only works if mode is "compatible" with another
      exclusive waiter.
      
      If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
      abort_exclusive_wait() won't wait an uninterruptible waiter.
      
      The main user is __wait_on_bit_lock() and currently it is fine but only
      because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
      lock_page_interruptible() yet.
      
      Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
      Yes, this means that (say) wake_up_interruptible() can wake up the non-
      interruptible waiter(s), but I think this is fine. And in fact I think
      that abort_exclusive_wait() must die, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38a3e1fc
  2. 29 9月, 2016 1 次提交
  3. 26 9月, 2016 1 次提交
  4. 22 9月, 2016 2 次提交
  5. 21 9月, 2016 2 次提交
    • N
      vti6: fix input path · 63c43787
      Nicolas Dichtel 提交于
      Since commit 1625f452, vti6 is broken, all input packets are dropped
      (LINUX_MIB_XFRMINNOSTATES is incremented).
      
      XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling
      xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in
      xfrm6_rcv_spi().
      
      A new function xfrm6_rcv_tnl() that enables to pass a value to
      xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function
      is used in several handlers).
      
      CC: Alexey Kodanev <alexey.kodanev@oracle.com>
      Fixes: 1625f452 ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key")
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      63c43787
    • A
      fix fault_in_multipages_...() on architectures with no-op access_ok() · e23d4159
      Al Viro 提交于
      Switching iov_iter fault-in to multipages variants has exposed an old
      bug in underlying fault_in_multipages_...(); they break if the range
      passed to them wraps around.  Normally access_ok() done by callers will
      prevent such (and it's a guaranteed EFAULT - ERR_PTR() values fall into
      such a range and they should not point to any valid objects).
      
      However, on architectures where userland and kernel live in different
      MMU contexts (e.g. s390) access_ok() is a no-op and on those a range
      with a wraparound can reach fault_in_multipages_...().
      
      Since any wraparound means EFAULT there, the fix is trivial - turn
      those
      
          while (uaddr <= end)
      	    ...
      into
      
          if (unlikely(uaddr > end))
      	    return -EFAULT;
          do
      	    ...
          while (uaddr <= end);
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Cc: stable@vger.kernel.org # v3.5+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e23d4159
  6. 20 9月, 2016 2 次提交
  7. 18 9月, 2016 1 次提交
  8. 17 9月, 2016 2 次提交
  9. 14 9月, 2016 2 次提交
  10. 13 9月, 2016 2 次提交
  11. 10 9月, 2016 2 次提交
  12. 09 9月, 2016 1 次提交
  13. 08 9月, 2016 1 次提交
  14. 07 9月, 2016 2 次提交
  15. 05 9月, 2016 6 次提交
    • J
      efi/libstub: Introduce ExitBootServices helper · fc07716b
      Jeffrey Hugo 提交于
      The spec allows ExitBootServices to fail with EFI_INVALID_PARAMETER if a
      race condition has occurred where the EFI has updated the memory map after
      the stub grabbed a reference to the map.  The spec defines a retry
      proceedure with specific requirements to handle this scenario.
      
      This scenario was previously observed on x86 - commit d3768d88 ("x86,
      efi: retry ExitBootServices() on failure") but the current fix is not spec
      compliant and the scenario is now observed on the Qualcomm Technologies
      QDF2432 via the FDT stub which does not handle the error and thus causes
      boot failures.  The user will notice the boot failure as the kernel is not
      executed and the system may drop back to a UEFI shell, but will be
      unresponsive to input and the system will require a power cycle to recover.
      
      Add a helper to the stub library that correctly adheres to the spec in the
      case of EFI_INVALID_PARAMETER from ExitBootServices and can be universally
      used across all stub implementations.
      Signed-off-by: NJeffrey Hugo <jhugo@codeaurora.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      fc07716b
    • J
      efi/libstub: Allocate headspace in efi_get_memory_map() · dadb57ab
      Jeffrey Hugo 提交于
      efi_get_memory_map() allocates a buffer to store the memory map that it
      retrieves.  This buffer may need to be reused by the client after
      ExitBootServices() is called, at which point allocations are not longer
      permitted.  To support this usecase, provide the allocated buffer size back
      to the client, and allocate some additional headroom to account for any
      reasonable growth in the map that is likely to happen between the call to
      efi_get_memory_map() and the client reusing the buffer.
      Signed-off-by: NJeffrey Hugo <jhugo@codeaurora.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      dadb57ab
    • J
      efi: Make for_each_efi_memory_desc_in_map() cope with running on Xen · d4c4fed0
      Jan Beulich 提交于
      While commit 55f1ea15 ("efi: Fix for_each_efi_memory_desc_in_map()
      for empty memmaps") made an attempt to deal with empty memory maps, it
      didn't address the case where the map field never gets set, as is
      apparently the case when running under Xen.
      
      Reported-by: <lists@ssl-mail.com>
      Tested-by: <lists@ssl-mail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: <stable@vger.kernel.org> # v4.7+
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      [ Guard the loop with a NULL check instead of pointer underflow ]
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      d4c4fed0
    • J
      locking/barriers: Don't use sizeof(void) in lockless_dereference() · d7127b5e
      Johannes Berg 提交于
      My previous commit:
      
        112dc0c8 ("locking/barriers: Suppress sparse warnings in lockless_dereference()")
      
      caused sparse to complain that (in radix-tree.h) we use sizeof(void)
      since that rcu_dereference()s a void *.
      
      Really, all we need is to have the expression *p in here somewhere
      to make sure p is a pointer type, and sizeof(*p) was the thing that
      came to my mind first to make sure that's done without really doing
      anything at runtime.
      
      Another thing I had considered was using typeof(*p), but obviously
      we can't just declare a typeof(*p) variable either, since that may
      end up being void. Declaring a variable as typeof(*p)* gets around
      that, and still checks that typeof(*p) is valid, so do that. This
      type construction can't be done for _________p1 because that will
      actually be used and causes sparse address space warnings, so keep
      a separate unused variable for it.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kbuild-all@01.org
      Fixes: 112dc0c8 ("locking/barriers: Suppress sparse warnings in lockless_dereference()")
      Link: http://lkml.kernel.org/r/1472192160-4049-1-git-send-email-johannes@sipsolutions.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7127b5e
    • L
      af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock' · 6e1ce3c3
      Linus Torvalds 提交于
      Right now we use the 'readlock' both for protecting some of the af_unix
      IO path and for making the bind be single-threaded.
      
      The two are independent, but using the same lock makes for a nasty
      deadlock due to ordering with regards to filesystem locking.  The bind
      locking would want to nest outside the VSF pathname locking, but the IO
      locking wants to nest inside some of those same locks.
      
      We tried to fix this earlier with commit c845acb3 ("af_unix: Fix
      splice-bind deadlock") which moved the readlock inside the vfs locks,
      but that caused problems with overlayfs that will then call back into
      filesystem routines that take the lock in the wrong order anyway.
      
      Splitting the locks means that we can go back to having the bind lock be
      the outermost lock, and we don't have any deadlocks with lock ordering.
      Acked-by: NRainer Weikusat <rweikusat@cyberadapt.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e1ce3c3
    • M
      bonding: Fix bonding crash · 24b27fc4
      Mahesh Bandewar 提交于
      Following few steps will crash kernel -
      
        (a) Create bonding master
            > modprobe bonding miimon=50
        (b) Create macvlan bridge on eth2
            > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
      	   type macvlan
        (c) Now try adding eth2 into the bond
            > echo +eth2 > /sys/class/net/bond0/bonding/slaves
            <crash>
      
      Bonding does lots of things before checking if the device enslaved is
      busy or not.
      
      In this case when the notifier call-chain sends notifications, the
      bond_netdev_event() assumes that the rx_handler /rx_handler_data is
      registered while the bond_enslave() hasn't progressed far enough to
      register rx_handler for the new slave.
      
      This patch adds a rx_handler check that can be performed right at the
      beginning of the enslave code to avoid getting into this situation.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b27fc4
  16. 03 9月, 2016 1 次提交
    • L
      ACPI / drivers: fix typo in ACPI_DECLARE_PROBE_ENTRY macro · 3feab13c
      Lorenzo Pieralisi 提交于
      When the ACPI_DECLARE_PROBE_ENTRY macro was added in
      commit e647b532 ("ACPI: Add early device probing infrastructure"),
      a stub macro adding an unused entry was added for the !CONFIG_ACPI
      Kconfig option case to make sure kernel code making use of the
      macro did not require to be guarded within CONFIG_ACPI in order to
      be compiled.
      
      The stub macro was never used since all kernel code that defines
      ACPI_DECLARE_PROBE_ENTRY entries is currently guarded within
      CONFIG_ACPI; it contains a typo that should be nonetheless fixed.
      
      Fix the typo in the stub (ie !CONFIG_ACPI) ACPI_DECLARE_PROBE_ENTRY()
      macro so that it can actually be used if needed.
      Signed-off-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Fixes: e647b532 (ACPI: Add early device probing infrastructure)
      Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      3feab13c
  17. 02 9月, 2016 3 次提交
  18. 01 9月, 2016 2 次提交
    • M
      ovl: don't cache acl on overlay layer · 2a3a2a3f
      Miklos Szeredi 提交于
      Some operations (setxattr/chmod) can make the cached acl stale.  We either
      need to clear overlay's acl cache for the affected inode or prevent acl
      caching on the overlay altogether.  Preventing caching has the following
      advantages:
      
       - no double caching, less memory used
      
       - overlay cache doesn't go stale when fs clears it's own cache
      
      Possible disadvantage is performance loss.  If that becomes a problem
      get_acl() can be optimized for overlayfs.
      
      This patch disables caching by pre setting i_*acl to a value that
      
        - has bit 0 set, so is_uncached_acl() will return true
      
        - is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it
      
      The constant -3 was chosen for this purpose.
      
      Fixes: 39a25b2b ("ovl: define ->get_acl() for overlay inodes")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2a3a2a3f
    • M
      mm: introduce get_task_exe_file · cd81a917
      Mateusz Guzik 提交于
      For more convenient access if one has a pointer to the task.
      
      As a minor nit take advantage of the fact that only task lock + rcu are
      needed to safely grab ->exe_file. This saves mm refcount dance.
      
      Use the helper in proc_exe_link.
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NRichard Guy Briggs <rgb@redhat.com>
      Cc: <stable@vger.kernel.org> # 4.3.x
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      cd81a917