1. 08 11月, 2017 1 次提交
    • W
      locking/pvqspinlock: Implement hybrid PV queued/unfair locks · 11752adb
      Waiman Long 提交于
      Currently, all the lock waiters entering the slowpath will do one
      lock stealing attempt to acquire the lock. That helps performance,
      especially in VMs with over-committed vCPUs. However, the current
      pvqspinlocks still don't perform as good as unfair locks in many cases.
      On the other hands, unfair locks do have the problem of lock starvation
      that pvqspinlocks don't have.
      
      This patch combines the best attributes of an unfair lock and a
      pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
      mode. A lock waiter goes into the unfair mode when there are waiters
      in the wait queue but the pending bit isn't set. Otherwise, it will
      go into the queued mode waiting in the queue for its turn.
      
      On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
      (make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
      best time selected and <n> is the number of vCPUs available. The build
      times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
      with various number of vCPUs are as follows:
      
        vCPUs    pvqlock     hybrid pvqlock    unfair lock
        -----    -------     --------------    -----------
          30      342.1s         329.1s          329.1s
          36      314.1s         305.3s          307.3s
          45      345.0s         302.1s          306.6s
          54      365.4s         308.6s          307.8s
          72      358.9s         293.6s          303.9s
         108      343.0s         285.9s          304.2s
      
      The hybrid pvqspinlock performs better or comparable to the unfair
      lock.
      
      By turning on QUEUED_LOCK_STAT, the table below showed the number
      of lock acquisitions in unfair mode and queue mode after a kernel
      build with various number of vCPUs.
      
        vCPUs    queued mode  unfair mode
        -----    -----------  -----------
          30      9,130,518      294,954
          36     10,856,614      386,809
          45      8,467,264   11,475,373
          54      6,409,987   19,670,855
          72      4,782,063   25,712,180
      
      It can be seen that as the VM became more and more over-committed,
      the ratio of locks acquired in unfair mode increases. This is all
      done automatically to get the best overall performance as possible.
      
      Using a kernel locking microbenchmark with number of locking
      threads equals to the number of vCPUs available on the same machine,
      the minimum, average and maximum (min/avg/max) numbers of locking
      operations done per thread in a 5-second testing interval are shown
      below:
      
        vCPUs         hybrid pvqlock             unfair lock
        -----         --------------             -----------
          36     822,135/881,063/950,363    75,570/313,496/  690,465
          54     542,435/581,664/625,937    35,460/204,280/  457,172
          72     397,500/428,177/499,299    17,933/150,679/  708,001
         108     257,898/288,150/340,871     3,085/181,176/1,257,109
      
      It can be seen that the hybrid pvqspinlocks are more fair and
      performant than the unfair locks in this test.
      
      The table below shows the kernel build times on a smaller 2-socket
      16-core 32-thread E5-2620 v4 system.
      
        vCPUs    pvqlock     hybrid pvqlock    unfair lock
        -----    -------     --------------    -----------
          16      436.8s         433.4s          435.6s
          36      366.2s         364.8s          364.5s
          48      423.6s         376.3s          370.2s
          64      433.1s         376.6s          376.8s
      
      Again, the performance of the hybrid pvqspinlock was comparable to
      that of the unfair lock.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1510089486-3466-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11752adb
  2. 07 11月, 2017 1 次提交
  3. 02 11月, 2017 5 次提交
    • J
      futex: futex_wake_op, do not fail on invalid op · e78c38f6
      Jiri Slaby 提交于
      In commit 30d6e0a4 ("futex: Remove duplicated code and fix undefined
      behaviour"), I let FUTEX_WAKE_OP to fail on invalid op.  Namely when op
      should be considered as shift and the shift is out of range (< 0 or > 31).
      
      But strace's test suite does this madness:
      
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);
      
      When I pick the first 0xa0caffee, it decodes as:
      
        0x80000000 & 0xa0caffee: oparg is shift
        0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
        0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
        0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
        0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18
      
      That means the op tries to do this:
      
        (futex |= (1 << (-849))) == -18
      
      which is completely bogus. The new check of op in the code is:
      
              if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
                      if (oparg < 0 || oparg > 31)
                              return -EINVAL;
                      oparg = 1 << oparg;
              }
      
      which results obviously in the "Invalid argument" errno:
      
        FAIL: futex
        ===========
      
        futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
        futex.test: failed test: ../futex failed with code 1
      
      So let us soften the failure to print only a (ratelimited) message, crop
      the value and continue as if it were right.  When userspace keeps up, we
      can switch this to return -EINVAL again.
      
      [v2] Do not return 0 immediatelly, proceed with the cropped value.
      
      Fixes: 30d6e0a4 ("futex: Remove duplicated code and fix undefined behaviour")
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e78c38f6
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
    • A
      signal: Fix name of SIGEMT in #if defined() check · c3aff086
      Andrew Clayton 提交于
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      added a check for SIGMET and NSIGEMT being defined. That SIGMET should
      in fact be SIGEMT, with SIGEMT being defined in
      arch/{alpha,mips,sparc}/include/uapi/asm/signal.h
      
      This was actually pointed out by BenHutchings in a lwn.net comment
      here https://lwn.net/Comments/734608/
      
      Fixes: cc731525 ("signal: Remove kernel interal si_code magic")
      Signed-off-by: NAndrew Clayton <andrew@digital-domain.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c3aff086
    • D
      watchdog/hardlockup/perf: Use atomics to track in-use cpu counter · 42f930da
      Don Zickus 提交于
      Guenter reported:
        There is still a problem. When running 
          echo 6 > /proc/sys/kernel/watchdog_thresh
          echo 5 > /proc/sys/kernel/watchdog_thresh
        repeatedly, the message
       
         NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
       
        stops after a while (after ~10-30 iterations, with fluctuations).
        Maybe watchdog_cpus needs to be atomic ?
      
      That's correct as this again is affected by the asynchronous nature of the
      smpboot thread unpark mechanism.
      
      CPU 0				CPU1			CPU2
      write(watchdog_thresh, 6)	
        stop()
          park()
        update()
        start()
          unpark()
      				thread->unpark()
      				  cnt++;
      write(watchdog_thresh, 5)				thread->unpark()
        stop()
          park()			thread->park()
      				   cnt--;		  cnt++;
        update()
        start()
          unpark()
      
      That's not a functional problem, it just affects the informational message.
      
      Convert watchdog_cpus to atomic_t to prevent the problem
      Reported-and-tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com
      
      42f930da
    • T
      watchdog/harclockup/perf: Revert a33d4484 ("watchdog/hardlockup/perf:... · 9c388a5e
      Thomas Gleixner 提交于
      watchdog/harclockup/perf: Revert a33d4484 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
      
      Guenter reported a crash in the watchdog/perf code, which is caused by
      cleanup() and enable() running concurrently. The reason for this is:
      
      The watchdog functions are serialized via the watchdog_mutex and cpu
      hotplug locking, but the enable of the perf based watchdog happens in
      context of the unpark callback of the smpboot thread. But that unpark
      function is not synchronous inside the locking. The unparking of the thread
      just wakes it up and leaves so there is no guarantee when the thread is
      executing.
      
      If it starts running _before_ the cleanup happened then it will create a
      event and overwrite the dead event pointer. The new event is then cleaned
      up because the event is marked dead.
      
          lock(watchdog_mutex);
          lockup_detector_reconfigure();
              cpus_read_lock();
      	stop();
      	   park()
      	update();
      	start();
      	   unpark()
      	cpus_read_unlock();		thread runs()
      					  overwrite dead event ptr
      	cleanup();
      	  free new event, which is active inside perf....
          unlock(watchdog_mutex);
      
      The park side is safe as that actually waits for the thread to reach
      parked state.
      
      Commit a33d4484 removed the protection against this kind of scenario
      under the stupid assumption that the hotplug serialization and the
      watchdog_mutex cover everything. 
      
      Bring it back.
      
      Reverts: a33d4484 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
      Reported-and-tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NThomas Feels-stupid Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos
      
      
      9c388a5e
  4. 01 11月, 2017 2 次提交
  5. 30 10月, 2017 2 次提交
    • L
      workqueue: Fix NULL pointer dereference · cef572ad
      Li Bin 提交于
      When queue_work() is used in irq (not in task context), there is
      a potential case that trigger NULL pointer dereference.
      ----------------------------------------------------------------
      worker_thread()
      |-spin_lock_irq()
      |-process_one_work()
      	|-worker->current_pwq = pwq
      	|-spin_unlock_irq()
      	|-worker->current_func(work)
      	|-spin_lock_irq()
       	|-worker->current_pwq = NULL
      |-spin_unlock_irq()
      
      				//interrupt here
      				|-irq_handler
      					|-__queue_work()
      						//assuming that the wq is draining
      						|-is_chained_work(wq)
      							|-current_wq_worker()
      							//Here, 'current' is the interrupted worker!
      								|-current->current_pwq is NULL here!
      |-schedule()
      ----------------------------------------------------------------
      
      Avoid it by checking for task context in current_wq_worker(), and
      if not in task context, we shouldn't use the 'current' to check the
      condition.
      Reported-by: NXiaofei Tan <tanxiaofei@huawei.com>
      Signed-off-by: NLi Bin <huawei.libin@huawei.com>
      Reviewed-by: NLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 8d03ecfe ("workqueue: reimplement is_chained_work() using current_wq_worker()")
      Cc: stable@vger.kernel.org # v3.9+
      cef572ad
    • T
      perf/cgroup: Fix perf cgroup hierarchy support · be96b316
      Tejun Heo 提交于
      The following commit:
      
        864c2357 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
      
      made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
      targets %current's cgroup.
      
      This breaks perf_event's hierarchical support because events which target one
      of the ancestors get ignored.
      
      Fix it by using cgroup_is_descendant() test instead of equality.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: David Carrillo-Cisneros <davidcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel-team@fb.com
      Cc: stable@vger.kernel.org # v4.9+
      Fixes: 864c2357 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
      Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be96b316
  6. 29 10月, 2017 2 次提交
  7. 25 10月, 2017 8 次提交
    • B
      workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes · fd1a5b04
      Byungchul Park 提交于
      The workqueue code added manual lock acquisition annotations to catch
      deadlocks.
      
      After lockdepcrossrelease was introduced, some of those became redundant,
      since wait_for_completion() already does the acquisition and tracking.
      
      Remove the duplicate annotations.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: darrick.wong@oracle.com
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-9-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fd1a5b04
    • B
      locking/lockdep: Introduce CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=y · e121d64e
      Byungchul Park 提交于
      Add a Kconfig knob that enables the lockdep "crossrelease_fullstack" boot parameter.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: darrick.wong@oracle.com
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johan@kernel.org
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-7-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e121d64e
    • B
      locking/lockdep: Add a boot parameter allowing unwind in cross-release and disable it by default · d141babe
      Byungchul Park 提交于
      Johan Hovold reported a heavy performance regression caused by lockdep
      cross-release:
      
       > Boot time (from "Linux version" to login prompt) had in fact doubled
       > since 4.13 where it took 17 seconds (with my current config) compared to
       > the 35 seconds I now see with 4.14-rc4.
       >
       > I quick bisect pointed to lockdep and specifically the following commit:
       >
       >	28a903f6 ("locking/lockdep: Handle non(or multi)-acquisition
       >	               of a crosslock")
       >
       > which I've verified is the commit which doubled the boot time (compared
       > to 28a903f6^) (added by lockdep crossrelease series [1]).
      
      Currently cross-release performs unwind on every acquisition, but that
      is very expensive.
      
      This patch makes unwind optional and disables it by default and only
      records acquire_ip.
      
      Full stack traces are sometimes required for full analysis, in which
      case a boot paramter, crossrelease_fullstack, can be specified.
      
      On my qemu Ubuntu machine (x86_64, 4 cores, 512M), the regression was
      fixed. We measure boot times with 'perf stat --null --repeat 10 $QEMU',
      where $QEMU launches a kernel with init=/bin/true:
      
      1. No lockdep enabled:
      
       Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):
      
             2.756558155 seconds time elapsed                    ( +-  0.09% )
      
      2. Lockdep enabled:
      
       Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):
      
             2.968710420 seconds time elapsed                    ( +-  0.12% )
      
      3. Lockdep enabled + cross-release enabled:
      
       Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):
      
             3.153839636 seconds time elapsed                    ( +-  0.31% )
      
      4. Lockdep enabled + cross-release enabled + this patch applied:
      
       Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):
      
             2.963669551 seconds time elapsed                    ( +-  0.11% )
      
      I.e. lockdep cross-release performance is now indistinguishable from
      vanilla lockdep.
      Bisected-by: NJohan Hovold <johan@kernel.org>
      Analyzed-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NJohan Hovold <johan@kernel.org>
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: amir73il@gmail.com
      Cc: axboe@kernel.dk
      Cc: darrick.wong@oracle.com
      Cc: david@fromorbit.com
      Cc: hch@infradead.org
      Cc: idryomov@gmail.com
      Cc: johannes.berg@intel.com
      Cc: kernel-team@lge.com
      Cc: linux-block@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-xfs@vger.kernel.org
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1508921765-15396-5-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d141babe
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
    • M
      locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() · c95491ed
      Mark Rutland 提交于
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't currently harmful.
      
      However, for some features it is necessary to instrument reads and
      writes separately, which is not possible with ACCESS_ONCE(). This
      distinction is critical to correct operation.
      
      It's possible to transform the bulk of kernel code using the Coccinelle
      script below. However, this doesn't handle comments, leaving references
      to ACCESS_ONCE() instances which have been removed. As a preparatory
      step, this patch converts the workqueue code and comments to use
      {READ,WRITE}_ONCE() consistently.
      
      ----
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c95491ed
    • W
      locking/qrwlock: Prevent slowpath writers getting held up by fastpath · d1331661
      Will Deacon 提交于
      When a prospective writer takes the qrwlock locking slowpath due to the
      lock being held, it attempts to cmpxchg the wmode field from 0 to
      _QW_WAITING so that concurrent lockers also take the slowpath and queue
      on the spinlock accordingly, allowing the lockers to drain.
      
      Unfortunately, this isn't fair, because a fastpath writer that comes in
      after the lock is made available but before the _QW_WAITING flag is set
      can effectively jump the queue. If there is a steady stream of prospective
      writers, then the waiter will be held off indefinitely.
      
      This patch restores fairness by separating _QW_WAITING and _QW_LOCKED
      into two distinct fields: _QW_LOCKED continues to occupy the bottom byte
      of the lockword so that it can be cleared unconditionally when unlocking,
      but _QW_WAITING now occupies what used to be the bottom bit of the reader
      count. This then forces the slow-path for concurrent lockers.
      Tested-by: NWaiman Long <longman@redhat.com>
      Tested-by: NJeremy Linton <jeremy.linton@arm.com>
      Tested-by: NAdam Wallis <awallis@codeaurora.org>
      Tested-by: NJan Glauber <jglauber@cavium.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Jeremy.Linton@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1507810851-306-6-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1331661
    • W
      locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock · b519b56e
      Will Deacon 提交于
      The qrwlock slowpaths involve spinning when either a prospective reader
      is waiting for a concurrent writer to drain, or a prospective writer is
      waiting for concurrent readers to drain. In both of these situations,
      atomic_cond_read_acquire() can be used to avoid busy-waiting and make use
      of any backoff functionality provided by the architecture.
      
      This patch replaces the open-code loops and rspin_until_writer_unlock()
      implementation with atomic_cond_read_acquire(). The write mode transition
      zero to _QW_WAITING is left alone, since (a) this doesn't need acquire
      semantics and (b) should be fast.
      Tested-by: NWaiman Long <longman@redhat.com>
      Tested-by: NJeremy Linton <jeremy.linton@arm.com>
      Tested-by: NAdam Wallis <awallis@codeaurora.org>
      Tested-by: NJan Glauber <jglauber@cavium.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Jeremy.Linton@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1507810851-306-4-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b519b56e
    • W
      locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock' · e0d02285
      Will Deacon 提交于
      There's no good reason to keep the internal structure of struct qrwlock
      hidden from qrwlock.h, particularly as it's actually needed for unlock
      and ends up being abstracted independently behind the __qrwlock_write_byte()
      function.
      
      Stop pretending we can hide this stuff, and move the __qrwlock definition
      into qrwlock, removing the __qrwlock_write_byte() nastiness and using the
      same struct definition everywhere instead.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Jeremy.Linton@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1507810851-306-2-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e0d02285
  8. 24 10月, 2017 1 次提交
  9. 22 10月, 2017 3 次提交
  10. 21 10月, 2017 2 次提交
  11. 20 10月, 2017 6 次提交
  12. 19 10月, 2017 3 次提交
    • D
      bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations · bc6d5031
      Daniel Borkmann 提交于
      PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
      allocator. Given we support __GFP_NOWARN now, lets just let
      the allocation request fail naturally instead. The two call
      sites from BPF mistakenly assumed __GFP_NOWARN would work, so
      no changes needed to their actual __alloc_percpu_gfp() calls
      which use the flag already.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc6d5031
    • D
      bpf: fix splat for illegal devmap percpu allocation · 82f8dd28
      Daniel Borkmann 提交于
      It was reported that syzkaller was able to trigger a splat on
      devmap percpu allocation due to illegal/unsupported allocation
      request size passed to __alloc_percpu():
      
        [   70.094249] illegal size (32776) or align (8) for percpu allocation
        [   70.094256] ------------[ cut here ]------------
        [   70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630
        [...]
        [   70.094325] Call Trace:
        [   70.094328]  __alloc_percpu_gfp+0x12/0x20
        [   70.094330]  dev_map_alloc+0x134/0x1e0
        [   70.094331]  SyS_bpf+0x9bc/0x1610
        [   70.094333]  ? selinux_task_setrlimit+0x5a/0x60
        [   70.094334]  ? security_task_setrlimit+0x43/0x60
        [   70.094336]  entry_SYSCALL_64_fastpath+0x1a/0xa5
      
      This was due to too large max_entries for the map such that we
      surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to
      fail naturally here, so switch to __alloc_percpu_gfp() and pass
      __GFP_NOWARN instead.
      
      Fixes: 11393cc9 ("xdp: Add batching support to redirect map")
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Reported-by: NShankara Pailoor <sp3485@columbia.edu>
      Reported-by: NRichard Weinberger <richard@nod.at>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82f8dd28
    • B
      locking/static_keys: Improve uninitialized key warning · 5cdda511
      Borislav Petkov 提交于
      Right now it says:
      
        static_key_disable_cpuslocked used before call to jump_label_init
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:161 static_key_disable_cpuslocked+0x68/0x70
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-rc5+ #1
        Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
        task: ffffffff81c0e480 task.stack: ffffffff81c00000
        RIP: 0010:static_key_disable_cpuslocked+0x68/0x70
        RSP: 0000:ffffffff81c03ef0 EFLAGS: 00010096 ORIG_RAX: 0000000000000000
        RAX: 0000000000000041 RBX: ffffffff81c32680 RCX: ffffffff81c5cbf8
        RDX: 0000000000000001 RSI: 0000000000000092 RDI: 0000000000000002
        RBP: ffff88807fffd240 R08: 726f666562206465 R09: 0000000000000136
        R10: 0000000000000000 R11: 696e695f6c656261 R12: ffffffff82158900
        R13: ffffffff8215f760 R14: 0000000000000001 R15: 0000000000000008
        FS:  0000000000000000(0000) GS:ffff883f7f400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff88807ffff000 CR3: 0000000001c09000 CR4: 00000000000606b0
        Call Trace:
         static_key_disable+0x16/0x20
         start_kernel+0x15a/0x45d
         ? load_ucode_intel_bsp+0x11/0x2d
         secondary_startup_64+0xa5/0xb0
        Code: 48 c7 c7 a0 15 cf 81 e9 47 53 4b 00 48 89 df e8 5f fc ff ff eb e8 48 c7 c6 \
      	c0 97 83 81 48 c7 c7 d0 ff a2 81 31 c0 e8 c5 9d f5 ff <0f> ff eb a7 0f ff eb \
      	b0 e8 eb a2 4b 00 53 48 89 fb e8 42 0e f0
      
      but it doesn't tell me which key it is. So dump the key's name too:
      
        static_key_disable_cpuslocked(): static key 'virt_spin_lock_key' used before call to jump_label_init()
      
      And that makes pinpointing which key is causing that a lot easier.
      
       include/linux/jump_label.h           |   14 +++++++-------
       include/linux/jump_label_ratelimit.h |    6 +++---
       kernel/jump_label.c                  |   14 +++++++-------
       3 files changed, 17 insertions(+), 17 deletions(-)
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171018152428.ffjgak4o25f7ept6@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5cdda511
  13. 18 10月, 2017 1 次提交
    • J
      bpf: disallow arithmetic operations on context pointer · 28e33f9d
      Jakub Kicinski 提交于
      Commit f1174f77 ("bpf/verifier: rework value tracking")
      removed the crafty selection of which pointer types are
      allowed to be modified.  This is OK for most pointer types
      since adjust_ptr_min_max_vals() will catch operations on
      immutable pointers.  One exception is PTR_TO_CTX which is
      now allowed to be offseted freely.
      
      The intent of aforementioned commit was to allow context
      access via modified registers.  The offset passed to
      ->is_valid_access() verifier callback has been adjusted
      by the value of the variable offset.
      
      What is missing, however, is taking the variable offset
      into account when the context register is used.  Or in terms
      of the code adding the offset to the value passed to the
      ->convert_ctx_access() callback.  This leads to the following
      eBPF user code:
      
           r1 += 68
           r0 = *(u32 *)(r1 + 8)
           exit
      
      being translated to this in kernel space:
      
         0: (07) r1 += 68
         1: (61) r0 = *(u32 *)(r1 +180)
         2: (95) exit
      
      Offset 8 is corresponding to 180 in the kernel, but offset
      76 is valid too.  Verifier will "accept" access to offset
      68+8=76 but then "convert" access to offset 8 as 180.
      Effective access to offset 248 is beyond the kernel context.
      (This is a __sk_buff example on a debug-heavy kernel -
      packet mark is 8 -> 180, 76 would be data.)
      
      Dereferencing the modified context pointer is not as easy
      as dereferencing other types, because we have to translate
      the access to reading a field in kernel structures which is
      usually at a different offset and often of a different size.
      To allow modifying the pointer we would have to make sure
      that given eBPF instruction will always access the same
      field or the fields accessed are "compatible" in terms of
      offset and size...
      
      Disallow dereferencing modified context pointers and add
      to selftests the test case described here.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e33f9d
  14. 14 10月, 2017 1 次提交
  15. 13 10月, 2017 2 次提交
    • D
      genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack() · 0d08af35
      Doug Berger 提交于
      Any usage of the irq_gc_mask_disable_reg_and_ack() function has
      been replaced with the desired functionality.
      
      The incorrect and ambiguously named function is removed here to
      prevent accidental misuse.
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0d08af35
    • D
      genirq: generic chip: Add irq_gc_mask_disable_and_ack_set() · 20608924
      Doug Berger 提交于
      The irq_gc_mask_disable_reg_and_ack() function name implies that it
      provides the combined functions of irq_gc_mask_disable_reg() and
      irq_gc_ack().  However, the implementation does not actually do
      that since it writes the mask instead of the disable register. It
      also does not maintain the mask cache which makes it inappropriate
      to use with other masking functions.
      
      In addition, commit 659fb32d ("genirq: replace irq_gc_ack() with
      {set,clr}_bit variants (fwd)") effectively renamed irq_gc_ack() to
      irq_gc_ack_set_bit() so this function probably should have also been
      renamed at that time.
      
      The generic chip code currently provides three functions for use
      with the irq_mask member of the irq_chip structure and two functions
      for use with the irq_ack member of the irq_chip structure. These
      functions could be combined into six functions for use with the
      irq_mask_ack member of the irq_chip structure.  However, since only
      one of the combinations is currently used, only the function
      irq_gc_mask_disable_and_ack_set() is added by this commit.
      
      The '_reg' and '_bit' portions of the base function name were left
      out of the new combined function name in an attempt to keep the
      function name length manageable with the 80 character source code
      line length while still allowing the distinct aspects of each
      combination to be captured by the name.
      
      If other combinations are desired in the future please add them to
      the irq generic chip library at that time.
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      20608924