1. 30 9月, 2016 14 次提交
    • V
      sched/core: Fix incorrect utilization accounting when switching to fair class · a399d233
      Vincent Guittot 提交于
      When a task switches to fair scheduling class, the period between now
      and the last update of its utilization is accounted as running time
      whatever happened during this period. This incorrect accounting applies
      to the task and also to the task group branch.
      
      When changing the property of a running task like its list of allowed
      CPUs or its scheduling class, we follow the sequence:
      
       - dequeue task
       - put task
       - change the property
       - set task as current task
       - enqueue task
      
      The end of the sequence doesn't follow the normal sequence (as per
      __schedule()) which is:
      
       - enqueue a task
       - then set the task as current task.
      
      This incorrectordering is the root cause of incorrect utilization accounting.
      Update the sequence to follow the right one:
      
       - dequeue task
       - put task
       - change the property
       - enqueue task
       - set task as current task
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a399d233
    • P
      sched/core: Optimize SCHED_SMT · 1b568f0a
      Peter Zijlstra 提交于
      Avoid pointless SCHED_SMT code when running on !SMT hardware.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b568f0a
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • P
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra 提交于
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      24fc7edb
    • P
      sched/core: Restructure destroy_sched_domain() · 16f3ef46
      Peter Zijlstra 提交于
      There is no point in doing a call_rcu() for each domain, only do a
      callback for the root sched domain and clean up the entire set in one
      go.
      
      Also make the entire call chain be called destroy_sched_domain*() to
      remove confusion with the free_sched_domains() call, which does an
      entirely different thing.
      
      Both cpu_attach_domain() callers of destroy_sched_domain() can live
      without the call_rcu() because at those points the sched_domain hasn't
      been published yet.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      16f3ef46
    • P
      sched/core: Remove unused @cpu argument from destroy_sched_domain*() · f39180ef
      Peter Zijlstra 提交于
      Small cleanup; nothing uses the @cpu argument so make it go away.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f39180ef
    • O
      sched/wait: Introduce init_wait_entry() · 0176beaf
      Oleg Nesterov 提交于
      The partial initialization of wait_queue_t in prepare_to_wait_event() looks
      ugly. This was done to shrink .text, but we can simply add the new helper
      which does the full initialization and shrink the compiled code a bit more.
      
      And. This way prepare_to_wait_event() can have more users. In particular we
      are ready to remove the signal_pending_state() checks from wait_bit_action_f
      helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0176beaf
    • O
      sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() · eaf9ef52
      Oleg Nesterov 提交于
      __wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
      now it can't use prepare_to_wait_event() (see the next change), but
      it can do the additional finish_wait() if action() fails.
      
      abort_exclusive_wait() no longer has callers, remove it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eaf9ef52
    • O
      sched/wait: Avoid abort_exclusive_wait() in ___wait_event() · b1ea06a9
      Oleg Nesterov 提交于
      ___wait_event() doesn't really need abort_exclusive_wait(), we can simply
      change prepare_to_wait_event() to remove the waiter from q->task_list if
      it was interrupted.
      
      This simplifies the code/logic, and this way prepare_to_wait_event() can
      have more users, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      --
       include/linux/wait.h |    7 +------
       kernel/sched/wait.c  |   35 +++++++++++++++++++++++++----------
       2 files changed, 26 insertions(+), 16 deletions(-)
      b1ea06a9
    • O
      sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up() · 38a3e1fc
      Oleg Nesterov 提交于
      Otherwise this logic only works if mode is "compatible" with another
      exclusive waiter.
      
      If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
      abort_exclusive_wait() won't wait an uninterruptible waiter.
      
      The main user is __wait_on_bit_lock() and currently it is fine but only
      because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
      lock_page_interruptible() yet.
      
      Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
      Yes, this means that (say) wake_up_interruptible() can wake up the non-
      interruptible waiter(s), but I think this is fine. And in fact I think
      that abort_exclusive_wait() must die, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38a3e1fc
    • D
      sched/fair: Fix fixed point arithmetic width for shares and effective load · ab522e33
      Dietmar Eggemann 提交于
      Since commit:
      
        2159197d ("sched/core: Enable increased load resolution on 64-bit kernels")
      
      we now have two different fixed point units for load:
      
      - 'shares' in calc_cfs_shares() has 20 bit fixed point unit on 64-bit
        kernels. Therefore use scale_load() on MIN_SHARES.
      
      - 'wl' in effective_load() has 10 bit fixed point unit. Therefore use
        scale_load_down() on tg->shares which has 20 bit fixed point unit on
        64-bit kernels.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1471874441-24701-1-git-send-email-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab522e33
    • T
      sched/core, x86/topology: Fix NUMA in package topology bug · 8f37961c
      Tim Chen 提交于
      Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
      more than once (e.g. on CPU hot plug).  When this happens after
      sched_init_smp() has been called, we lose the NUMA topology extension to
      sched_domain_topology in sched_init_numa().  This results in incorrect
      topology when the sched domain is rebuilt.
      
      This patch fixes the bug and issues warning if we call sched_set_topology()
      after sched_init_smp().
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Cc: jolsa@redhat.com
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f37961c
    • I
      Merge branch 'linus' into sched/core, to pick up fixes · 536e0e81
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      536e0e81
  2. 29 9月, 2016 7 次提交
  3. 28 9月, 2016 1 次提交
  4. 26 9月, 2016 9 次提交
    • L
      Linux 4.8-rc8 · 08895a8b
      Linus Torvalds 提交于
      08895a8b
    • L
      Merge tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 4c04b4b5
      Linus Torvalds 提交于
      Pull tracefs fixes from Steven Rostedt:
       "Al Viro has been looking at the tracefs code, and has pointed out some
        issues.  This contains one fix by me and one by Al.  I'm sure that
        he'll come up with more but for now I tested these patches and they
        don't appear to have any negative impact on tracing"
      
      * tag 'trace-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        fix memory leaks in tracing_buffers_splice_read()
        tracing: Move mutex to protect against resetting of seq data
      4c04b4b5
    • D
      fault_in_multipages_readable() throws set-but-unused error · 90b75db6
      Dave Chinner 提交于
      When building XFS with -Werror, it now fails with:
      
        include/linux/pagemap.h: In function 'fault_in_multipages_readable':
        include/linux/pagemap.h:602:16: error: variable 'c' set but not used [-Werror=unused-but-set-variable]
          volatile char c;
                        ^
      
      This is a regression caused by commit e23d4159 ("fix
      fault_in_multipages_...() on architectures with no-op access_ok()").
      Fix it by re-adding the "(void)c" trick taht was previously used to make
      the compiler think the variable is used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b75db6
    • L
      mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · 38e08854
      Lorenzo Stoakes 提交于
      The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
      defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
      PMDs respectively as requiring balancing upon a subsequent page fault.
      User-defined PROT_NONE memory regions which also have this flag set will
      not normally invoke the NUMA balancing code as do_page_fault() will send
      a segfault to the process before handle_mm_fault() is even called.
      
      However if access_remote_vm() is invoked to access a PROT_NONE region of
      memory, handle_mm_fault() is called via faultin_page() and
      __get_user_pages() without any access checks being performed, meaning
      the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
      region.
      
      A simple means of triggering this problem is to access PROT_NONE mmap'd
      memory using /proc/self/mem which reliably results in the NUMA handling
      functions being invoked when CONFIG_NUMA_BALANCING is set.
      
      This issue was reported in bugzilla (issue 99101) which includes some
      simple repro code.
      
      There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
      added at commit c0e7cad9 to avoid accidentally provoking strange
      behaviour by attempting to apply NUMA balancing to pages that are in
      fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.
      
      This patch moves the PROT_NONE check into mm/memory.c rather than
      invoking BUG_ON() as faulting in these pages via faultin_page() is a
      valid reason for reaching the NUMA check with the PROT_NONE page table
      flag set and is therefore not always a bug.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: NTrevor Saunders <tbsaunde@tbsaunde.org>
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38e08854
    • L
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 831e45d8
      Linus Torvalds 提交于
      Pull MIPS fixes from Ralf Baechle:
       "A round of 4.8 fixes:
      
        MIPS generic code:
         - Add a missing ".set pop" in an early commit
         - Fix memory regions reaching top of physical
         - MAAR: Fix address alignment
         - vDSO: Fix Malta EVA mapping to vDSO page structs
         - uprobes: fix incorrect uprobe brk handling
         - uprobes: select HAVE_REGS_AND_STACK_ACCESS_API
         - Avoid a BUG warning during PR_SET_FP_MODE prctl
         - SMP: Fix possibility of deadlock when bringing CPUs online
         - R6: Remove compact branch policy Kconfig entries
         - Fix size calc when avoiding IPIs for small icache flushes
         - Fix pre-r6 emulation FPU initialisation
         - Fix delay slot emulation count in debugfs
      
        ATH79:
         - Fix test for error return of clk_register_fixed_factor.
      
        Octeon:
         - Fix kernel header to work for VDSO build.
         - Fix initialization of platform device probing.
      
        paravirt:
         - Fix undefined reference to smp_bootstrap"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: Fix delay slot emulation count in debugfs
        MIPS: SMP: Fix possibility of deadlock when bringing CPUs online
        MIPS: Fix pre-r6 emulation FPU initialisation
        MIPS: vDSO: Fix Malta EVA mapping to vDSO page structs
        MIPS: Select HAVE_REGS_AND_STACK_ACCESS_API
        MIPS: Octeon: Fix platform bus probing
        MIPS: Octeon: mangle-port: fix build failure with VDSO code
        MIPS: Avoid a BUG warning during prctl(PR_SET_FP_MODE, ...)
        MIPS: c-r4k: Fix size calc when avoiding IPIs for small icache flushes
        MIPS: Add a missing ".set pop" in an early commit
        MIPS: paravirt: Fix undefined reference to smp_bootstrap
        MIPS: Remove compact branch policy Kconfig entries
        MIPS: MAAR: Fix address alignment
        MIPS: Fix memory regions reaching top of physical
        MIPS: uprobes: fix incorrect uprobe brk handling
        MIPS: ath79: Fix test for error return of clk_register_fixed_factor().
      831e45d8
    • L
      Merge tag 'powerpc-4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 751b9a5d
      Linus Torvalds 提交于
      Pull one more powerpc fix from Michael Ellerman:
       "powernv/pci: Fix m64 checks for SR-IOV and window alignment from
        Russell Currey"
      
      * tag 'powerpc-4.8-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv/pci: Fix m64 checks for SR-IOV and window alignment
      751b9a5d
    • L
      radix tree: fix sibling entry handling in radix_tree_descend() · 8d2c0d36
      Linus Torvalds 提交于
      The fixes to the radix tree test suite show that the multi-order case is
      broken.  The basic reason is that the radix tree code uses tagged
      pointers with the "internal" bit in the low bits, and calculating the
      pointer indices was supposed to mask off those bits.  But gcc will
      notice that we then use the index to re-create the pointer, and will
      avoid doing the arithmetic and use the tagged pointer directly.
      
      This cleans the code up, using the existing is_sibling_entry() helper to
      validate the sibling pointer range (instead of open-coding it), and
      using entry_to_node() to mask off the low tag bit from the pointer.  And
      once you do that, you might as well just use the now cleaned-up pointer
      directly.
      
      [ Side note: the multi-order code isn't actually ever used in the kernel
        right now, and the only reason I didn't just delete all that code is
        that Kirill Shutemov piped up and said:
      
          "Well, my ext4-with-huge-pages patchset[1] uses multi-order entries.
           It also converts shmem-with-huge-pages and hugetlb to them.
      
           I'm okay with converting it to other mechanism, but I need
           something.  (I looked into Konstantin's RFC patchset[2].  It looks
           okay, but I don't feel myself qualified to review it as I don't
           know much about radix-tree internals.)"
      
        [1] http://lkml.kernel.org/r/20160915115523.29737-1-kirill.shutemov@linux.intel.com
        [2] http://lkml.kernel.org/r/147230727479.9957.1087787722571077339.stgit@zurg ]
      Reported-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Cedric Blancher <cedric.blancher@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d2c0d36
    • M
      radix tree test suite: Test radix_tree_replace_slot() for multiorder entries · 62fd5258
      Matthew Wilcox 提交于
      When we replace a multiorder entry, check that all indices reflect the
      new value.
      
      Also, compile the test suite with -O2, which shows other problems with
      the code due to some dodgy pointer operations in the radix tree code.
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62fd5258
    • A
      fix memory leaks in tracing_buffers_splice_read() · 1ae2293d
      Al Viro 提交于
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ae2293d
  5. 25 9月, 2016 9 次提交
    • S
      tracing: Move mutex to protect against resetting of seq data · 1245800c
      Steven Rostedt (Red Hat) 提交于
      The iter->seq can be reset outside the protection of the mutex. So can
      reading of user data. Move the mutex up to the beginning of the function.
      
      Fixes: d7350c3f ("tracing/core: make the read callbacks reentrants")
      Cc: stable@vger.kernel.org # 2.6.30+
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1245800c
    • P
      MIPS: Fix delay slot emulation count in debugfs · 116e7111
      Paul Burton 提交于
      Commit 432c6bac ("MIPS: Use per-mm page to execute branch delay slot
      instructions") accidentally removed use of the MIPS_FPU_EMU_INC_STATS
      macro from do_dsemulret, leading to the ds_emul file in debugfs always
      returning zero even though we perform delay slot emulations.
      
      Fix this by re-adding the use of the MIPS_FPU_EMU_INC_STATS macro.
      Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
      Fixes: 432c6bac ("MIPS: Use per-mm page to execute branch delay slot instructions")
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/14301/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      116e7111
    • M
      MIPS: SMP: Fix possibility of deadlock when bringing CPUs online · 8f46cca1
      Matt Redfearn 提交于
      This patch fixes the possibility of a deadlock when bringing up
      secondary CPUs.
      The deadlock occurs because the set_cpu_online() is called before
      synchronise_count_slave(). This can cause a deadlock if the boot CPU,
      having scheduled another thread, attempts to send an IPI to the
      secondary CPU, which it sees has been marked online. The secondary is
      blocked in synchronise_count_slave() waiting for the boot CPU to enter
      synchronise_count_master(), but the boot cpu is blocked in
      smp_call_function_many() waiting for the secondary to respond to it's
      IPI request.
      
      Fix this by marking the CPU online in cpu_callin_map and synchronising
      counters before declaring the CPU online and calculating the maps for
      IPIs.
      Signed-off-by: NMatt Redfearn <matt.redfearn@imgtec.com>
      Reported-by: NJustin Chen <justinpopo6@gmail.com>
      Tested-by: NJustin Chen <justinpopo6@gmail.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Cc: stable@vger.kernel.org # v4.1+
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/14302/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      8f46cca1
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9c0e28a7
      Linus Torvalds 提交于
      Pull perf fixes from Thomas Gleixner:
       "Three fixlets for perf:
      
         - add a missing NULL pointer check in the intel BTS driver
      
         - make BTS an exclusive PMU because BTS can only handle one event at
           a time
      
         - ensure that exclusive events are limited to one PMU so that several
           exclusive events can be scheduled on different PMU instances"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/core: Limit matching exclusive events to one PMU
        perf/x86/intel/bts: Make it an exclusive PMU
        perf/x86/intel/bts: Make sure debug store is valid
      9c0e28a7
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2507c856
      Linus Torvalds 提交于
      Pull locking fixes from Thomas Gleixner:
       "Two smallish fixes:
      
         - use the proper asm constraint in the Super-H atomic_fetch_ops
      
         - a trivial typo fix in the Kconfig help text"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/hung_task: Fix typo in CONFIG_DETECT_HUNG_TASK help text
        locking/atomic, arch/sh: Fix ATOMIC_FETCH_OP()
      2507c856
    • L
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 709b8f67
      Linus Torvalds 提交于
      Pull EFI fixes from Thomas Gleixner:
       "Two fixes for EFI/PAT:
      
         - a 32bit overflow bug in the PAT code which was unearthed by the
           large EFI mappings
      
         - prevent a boot hang on large systems when EFI mixed mode is enabled
           but not used"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/efi: Only map RAM into EFI page tables if in mixed-mode
        x86/mm/pat: Prevent hang during boot when mapping pages
      709b8f67
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4b8b0ff6
      Linus Torvalds 提交于
      Pull irq fixes from Thomas Gleixner:
       "Three fixes for irq core and irq chip drivers:
      
         - Do not set the irq type if type is NONE.  Fixes a boot regression
           on various SoCs
      
         - Use the proper cpu for setting up the GIC target list.  Discovered
           by the cpumask debugging code.
      
         - A rather large fix for the MIPS-GIC so per cpu local interrupts
           work again.  This was discovered late because the code falls back
           to slower timers which use normal device interrupts"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/mips-gic: Fix local interrupts
        irqchip/gicv3: Silence noisy DEBUG_PER_CPU_MAPS warning
        genirq: Skip chained interrupt trigger setup if type is IRQ_TYPE_NONE
      4b8b0ff6
    • L
      Merge branch 'hughd-fixes' (patches from Hugh Dickins) · 0f265741
      Linus Torvalds 提交于
      Merge VM fixes from High Dickins:
       "I get the impression that Andrew is away or busy at the moment, so I'm
        going to send you three independent uncontroversial little mm fixes
        directly - though none is strictly a 4.8 regression fix.
      
         - shmem: fix tmpfs to handle the huge= option properly from Toshi
           Kani is a one-liner to fix a major embarrassment in 4.8's hugepages
           on tmpfs feature: although Hillf pointed it out in June, somehow
           both Kirill and I repeatedly dropped the ball on this one.  You
           might wonder if the feature got tested at all with that bug in:
           yes, it did, but for wider testing coverage, Kirill and I had each
           relied too much on an override which bypasses that condition.
      
         - huge tmpfs: fix Committed_AS leak just a run-of-the-mill accounting
           fix in the same feature.
      
         - mm: delete unnecessary and unsafe init_tlb_ubc() is an unrelated
           fix to 4.3's TLB flush batching in reclaim: the bug would be rare,
           and none of us will be shamed if this one misses 4.8; but it got
           such a quick ack from Mel today that I'm inclined to offer it along
           with the first two"
      
      * emailed patches from Hugh Dickins <hughd@google.com>:
        mm: delete unnecessary and unsafe init_tlb_ubc()
        huge tmpfs: fix Committed_AS leak
        shmem: fix tmpfs to handle the huge= option properly
      0f265741
    • H
      mm: delete unnecessary and unsafe init_tlb_ubc() · b385d21f
      Hugh Dickins 提交于
      init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
      initialized with zeroes in the init_task, and copied from parent to
      child while it is quiescent in arch_dup_task_struct(); so I went to
      delete it.
      
      But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
      check that it was always empty at that point, and found them firing:
      because memcg reclaim can recurse into global reclaim (when allocating
      biosets for swapout in my case), and arrive back at the init_tlb_ubc()
      in shrink_node_memcg().
      
      Resetting tlb_ubc.flush_required at that point is wrong: if the upper
      level needs a deferred TLB flush, but the lower level turns out not to,
      we miss a TLB flush.  But fortunately, that's the only part of the
      protocol that does not nest: with the initialization removed, cpumask
      collects bits from upper and lower levels, and flushes TLB when needed.
      
      Fixes: 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b385d21f