1. 09 8月, 2017 1 次提交
    • H
      s390/vmcp: make use of contiguous memory allocator · 3f429842
      Heiko Carstens 提交于
      If memory is fragmented it is unlikely that large order memory
      allocations succeed. This has been an issue with the vmcp device
      driver since a long time, since it requires large physical contiguous
      memory ares for large responses.
      
      To hopefully resolve this issue make use of the contiguous memory
      allocator (cma). This patch adds a vmcp specific vmcp cma area with a
      default size of 4MB. The size can be changed either via the
      VMCP_CMA_SIZE config option at compile time or with the "vmcp_cma"
      kernel parameter (e.g. "vmcp_cma=16m").
      
      For any vmcp response buffers larger than 16k memory from the cma area
      will be allocated. If such an allocation fails, there is a fallback to
      the buddy allocator.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      3f429842
  2. 07 7月, 2017 2 次提交
    • M
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko 提交于
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • K
      mm: allow slab_nomerge to be set at build time · 7660a6fd
      Kees Cook 提交于
      Some hardened environments want to build kernels with slab_nomerge
      already set (so that they do not depend on remembering to set the kernel
      command line option).  This is desired to reduce the risk of kernel heap
      overflows being able to overwrite objects from merged caches and changes
      the requirements for cache layout control, increasing the difficulty of
      these attacks.  By keeping caches unmerged, these kinds of exploits can
      usually only damage objects in the same cache (though the risk to
      metadata exploitation is unchanged).
      
      Link: http://lkml.kernel.org/r/20170620230911.GA25238@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Daniel Mack <daniel@zonque.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7660a6fd
  3. 23 6月, 2017 1 次提交
  4. 22 6月, 2017 2 次提交
    • M
      ima: define a set of appraisal rules requiring file signatures · 503ceaef
      Mimi Zohar 提交于
      The builtin "ima_appraise_tcb" policy should require file signatures for
      at least a few of the hooks (eg. kernel modules, firmware, and the kexec
      kernel image), but changing it would break the existing userspace/kernel
      ABI.
      
      This patch defines a new builtin policy named "secure_boot", which
      can be specified on the "ima_policy=" boot command line, independently
      or in conjunction with the "ima_appraise_tcb" policy, by specifing
      ima_policy="appraise_tcb | secure_boot".  The new appraisal rules
      requiring file signatures will be added prior to the "ima_appraise_tcb"
      rules.
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      
      Changelog:
      - Reference secure boot in the new builtin policy name. (Thiago Bauermann)
      503ceaef
    • M
      ima: extend the "ima_policy" boot command line to support multiple policies · 33ce9549
      Mimi Zohar 提交于
      Add support for providing multiple builtin policies on the "ima_policy="
      boot command line.  Use "|" as the delimitor separating the policy names.
      Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
      33ce9549
  5. 20 6月, 2017 1 次提交
  6. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  7. 15 6月, 2017 3 次提交
  8. 09 6月, 2017 3 次提交
    • L
      doc: Add coresight_cpu_debug.enable to kernel-parameters.txt · 62a31ce1
      Leo Yan 提交于
      Add coresight_cpu_debug.enable to kernel-parameters.txt, this flag is
      used to enable/disable the CPU sampling based debugging.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62a31ce1
    • P
      rcu: Remove *_SLOW_* Kconfig options · 90040c9e
      Paul E. McKenney 提交于
      The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
      RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
      RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
      and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
      useful for torture testing, and there are the rcutree.gp_cleanup_delay,
      rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
      that rcutorture can use instead.  The effect of these parameters is to
      artificially slow down grace period initialization and cleanup in order
      to make some types of race conditions happen more often.
      
      This commit therefore simplifies Tree RCU a bit by removing the Kconfig
      options and adding the corresponding kernel parameters to rcutorture's
      .boot files instead.  However, this commit also leaves out the kernel
      parameters for TREE02, TREE04, and TREE07 in order to have about the
      same number of tests slowed as not slowed.  TREE01, TREE03, TREE05,
      and TREE06 are slowed, and the rest are not slowed.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      90040c9e
    • P
      srcu: Prevent sdp->srcu_gp_seq_needed counter wrap · c350c008
      Paul E. McKenney 提交于
      If a given CPU never happens to ever start an SRCU grace period, the
      grace-period sequence counter might wrap.  If this CPU were to decide to
      finally start a grace period, the state of its sdp->srcu_gp_seq_needed
      might make it appear that it has already requested this grace period,
      which would prevent starting the grace period.  If no other CPU ever started
      a grace period again, this would look like a grace-period hang.  Even
      if some other CPU took pity and started the needed grace period, the
      leaf rcu_node structure's ->srcu_data_have_cbs field won't have record
      of the fact that this CPU has a callback pending, which would look like
      a very localized grace-period hang.
      
      This might seem very unlikely, but SRCU grace periods can take less than
      a microsecond on small systems, which means that overflow can happen
      in much less than an hour on a 32-bit embedded system.  And embedded
      systems are especially likely to have long-term idle CPUs.  Therefore,
      it makes sense to prevent this scenario from happening.
      
      This commit therefore scans each srcu_data structure occasionally,
      with frequency controlled by the srcutree.counter_wrap_check kernel
      boot parameter.  This parameter can be set to something like 255
      in order to exercise the counter-wrap-prevention code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c350c008
  9. 08 6月, 2017 2 次提交
    • P
      rcuperf: Add writer_holdoff boot parameter · 820687a7
      Paul E. McKenney 提交于
      This commit adds a writer_holdoff boot parameter to rcuperf, which is
      intended to be used to test Tree SRCU's auto-expediting.  This
      boot parameter is in microseconds, and defaults to zero (that is,
      disabled).  Set it to a bit larger than srcutree.exp_holdoff,
      keeping the nanosecond/microsecond conversion, to force Tree SRCU
      to auto-expedite more aggressively.
      
      This commit also adds documentation for this parameter, and fixes some
      alphabetization while in the neighborhood.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      820687a7
    • P
      rcuperf: Add ability to performance-test call_rcu() and friends · 881ed593
      Paul E. McKenney 提交于
      This commit upgrades rcuperf so that it can do performance testing on
      asynchronous grace-period primitives such as call_srcu().  There is
      a new rcuperf.gp_async module parameter that specifies this new behavior,
      with the pre-existing rcuperf.gp_exp testing expedited grace periods such as
      synchronize_rcu_expedited, and with the default being to test synchronous
      non-expedited grace periods such as synchronize_rcu().
      
      There is also a new rcuperf.gp_async_max module parameter that specifies
      the maximum number of outstanding callbacks per writer kthread, defaulting
      to 1,000.  When this limit is exceeded, the writer thread invokes the
      appropriate flavor of rcu_barrier() to wait for callbacks to drain.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Removed the redundant initialization noted by Arnd Bergmann. ]
      881ed593
  10. 01 6月, 2017 1 次提交
  11. 24 5月, 2017 1 次提交
  12. 10 5月, 2017 1 次提交
  13. 01 5月, 2017 1 次提交
  14. 27 4月, 2017 2 次提交
    • P
      srcu: Specify auto-expedite holdoff time · 22607d66
      Paul E. McKenney 提交于
      On small systems, in the absence of readers, expedited SRCU grace
      periods can complete in less than a microsecond.  This means that an
      eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
      loop and almost always expedite.  This might actually be desirable in
      some situations, but in general it is a good way to needlessly burn
      CPU cycles.  And in those situations where it is desirable, your friend
      is the function synchronize_srcu_expedited().
      
      For other situations, this commit adds a kernel parameter that specifies
      a holdoff between completing the last SRCU grace period and auto-expediting
      the next.  If the next grace period starts before the holdoff expires,
      auto-expediting is disabled.  The holdoff is 50 microseconds by default,
      and can be tuned to the desired number of nanoseconds.  A value of zero
      disables auto-expediting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      22607d66
    • S
      x86, iommu/vt-d: Add an option to disable Intel IOMMU force on · bfd20f1c
      Shaohua Li 提交于
      IOMMU harms performance signficantly when we run very fast networking
      workloads. It's 40GB networking doing XDP test. Software overhead is
      almost unaware, but it's the IOTLB miss (based on our analysis) which
      kills the performance. We observed the same performance issue even with
      software passthrough (identity mapping), only the hardware passthrough
      survives. The pps with iommu (with software passthrough) is only about
      ~30% of that without it. This is a limitation in hardware based on our
      observation, so we'd like to disable the IOMMU force on, but we do want
      to use TBOOT and we can sacrifice the DMA security bought by IOMMU. I
      must admit I know nothing about TBOOT, but TBOOT guys (cc-ed) think not
      eabling IOMMU is totally ok.
      
      So introduce a new boot option to disable the force on. It's kind of
      silly we need to run into intel_iommu_init even without force on, but we
      need to disable TBOOT PMR registers. For system without the boot option,
      nothing is changed.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      bfd20f1c
  15. 21 4月, 2017 1 次提交
  16. 06 4月, 2017 1 次提交
    • W
      iommu: Allow default domain type to be set on the kernel command line · fccb4e3b
      Will Deacon 提交于
      The IOMMU core currently initialises the default domain for each group
      to IOMMU_DOMAIN_DMA, under the assumption that devices will use
      IOMMU-backed DMA ops by default. However, in some cases it is desirable
      for the DMA ops to bypass the IOMMU for performance reasons, reserving
      use of translation for subsystems such as VFIO that require it for
      enforcing device isolation.
      
      Rather than modify each IOMMU driver to provide different semantics for
      DMA domains, instead we introduce a command line parameter that can be
      used to change the type of the default domain. Passthrough can then be
      specified using "iommu.passthrough=1" on the kernel command line.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fccb4e3b
  17. 01 4月, 2017 1 次提交
  18. 28 3月, 2017 1 次提交
    • B
      RAS: Add a Corrected Errors Collector · 011d8261
      Borislav Petkov 提交于
      Introduce a simple data structure for collecting correctable errors
      along with accessors. More detailed description in the code itself.
      
      The error decoding is done with the decoding chain now and
      mce_first_notifier() gets to see the error first and the CEC decides
      whether to log it and then the rest of the chain doesn't hear about it -
      basically the main reason for the CE collector - or to continue running
      the notifiers.
      
      When the CEC hits the action threshold, it will try to soft-offine the
      page containing the ECC and then the whole decoding chain gets to see
      the error.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      011d8261
  19. 21 3月, 2017 1 次提交
  20. 06 3月, 2017 2 次提交
  21. 03 3月, 2017 1 次提交
  22. 23 2月, 2017 1 次提交
    • T
      slub: make sysfs directories for memcg sub-caches optional · 1663f26d
      Tejun Heo 提交于
      SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
      bunch of debug files.  Usually, there aren't that many caches on a
      system and this doesn't really matter; however, if memcg is in use, each
      cache can have per-cgroup sub-caches.  SLUB creates the same directories
      for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
      
      Unfortunately, because there can be a lot of cgroups, active or
      draining, the product of the numbers of caches, cgroups and files in
      each directory can reach a very high number - hundreds of thousands is
      commonplace.  Millions and beyond aren't difficult to reach either.
      
      What's under /sys/kernel/slab is primarily for debugging and the
      information and control on the a root cache already cover its
      sub-caches.  While having a separate directory for each sub-cache can be
      helpful for development, it doesn't make much sense to pay this amount
      of overhead by default.
      
      This patch introduces a boot parameter slub_memcg_sysfs which determines
      whether to create sysfs directories for per-memcg sub-caches.  It also
      adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
      default value and defaults to 0.
      
      [akpm@linux-foundation.org: kset_unregister(NULL) is legal]
      Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1663f26d
  23. 16 2月, 2017 1 次提交
    • T
      x86/platform/goldfish: Prevent unconditional loading · 47512cfd
      Thomas Gleixner 提交于
      The goldfish platform code registers the platform device unconditionally
      which causes havoc in several ways if the goldfish_pdev_bus driver is
      enabled:
      
       - Access to the hardcoded physical memory region, which is either not
         available or contains stuff which is completely unrelated.
      
       - Prevents that the interrupt of the serial port can be requested
      
       - In case of a spurious interrupt it goes into a infinite loop in the
         interrupt handler of the pdev_bus driver (which needs to be fixed
         seperately).
      
      Add a 'goldfish' command line option to make the registration opt-in when
      the platform is compiled in.
      
      I'm seriously grumpy about this engineering trainwreck, which has seven
      SOBs from Intel developers for 50 lines of code. And none of them figured
      out that this is broken. Impressive fail!
      
      Fixes: ddd70cf9 ("goldfish: platform device for x86")
      Reported-by: NGabriel C <nix.or.die@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47512cfd
  24. 08 2月, 2017 1 次提交
  25. 04 2月, 2017 1 次提交
  26. 16 1月, 2017 1 次提交
  27. 15 1月, 2017 1 次提交
  28. 12 1月, 2017 1 次提交
  29. 27 12月, 2016 1 次提交
  30. 21 12月, 2016 1 次提交
  31. 19 12月, 2016 1 次提交
    • G
      swiotlb: Add swiotlb=noforce debug option · fff5d992
      Geert Uytterhoeven 提交于
      On architectures like arm64, swiotlb is tied intimately to the core
      architecture DMA support. In addition, ZONE_DMA cannot be disabled.
      
      To aid debugging and catch devices not supporting DMA to memory outside
      the 32-bit address space, add a kernel command line option
      "swiotlb=noforce", which disables the use of bounce buffers.
      If specified, trying to map memory that cannot be used with DMA will
      fail, and a rate-limited warning will be printed.
      
      Note that io_tlb_nslabs is set to 1, which is the minimal supported
      value.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fff5d992