1. 11 8月, 2017 1 次提交
  2. 10 8月, 2017 3 次提交
  3. 15 7月, 2017 1 次提交
    • J
      sparc64: Measure receiver forward progress to avoid send mondo timeout · 9d53caec
      Jane Chu 提交于
      A large sun4v SPARC system may have moments of intensive xcall activities,
      usually caused by unmapping many pages on many CPUs concurrently. This can
      flood receivers with CPU mondo interrupts for an extended period, causing
      some unlucky senders to hit send-mondo timeout. This problem gets worse
      as cpu count increases because sometimes mappings must be invalidated on
      all CPUs, and sometimes all CPUs may gang up on a single CPU.
      
      But a busy system is not a broken system. In the above scenario, as long
      as the receiver is making forward progress processing mondo interrupts,
      the sender should continue to retry.
      
      This patch implements the receiver's forward progress meter by introducing
      a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range
      of 0..NR_CPUS. The receiver increments its counter as soon as it receives
      a mondo and the sender tracks the receiver's counter. If the receiver has
      stopped making forward progress when the retry limit is reached, the sender
      declares send-mondo-timeout and panic; otherwise, the receiver is allowed
      to keep making forward progress.
      
      In addition, it's been observed that PCIe hotplug events generate Correctable
      Errors that are handled by hypervisor and then OS. Hypervisor 'borrows'
      a guest cpu strand briefly to provide the service. If the cpu strand is
      simultaneously the only cpu targeted by a mondo, it may not be available
      for the mondo in 20msec, causing SUN4V mondo timeout. It appears that 1 second
      is the agreed wait time between hypervisor and guest OS, this patch makes
      the adjustment.
      
      Orabug: 25476541
      Orabug: 26417466
      Signed-off-by: NJane Chu <jane.chu@oracle.com>
      Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
      Reviewed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Reviewed-by: NRob Gardner <rob.gardner@oracle.com>
      Reviewed-by: NThomas Tai <thomas.tai@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d53caec
  4. 13 7月, 2017 2 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • N
      kernel/watchdog: introduce arch_touch_nmi_watchdog() · f2e0cff8
      Nicholas Piggin 提交于
      For architectures that define HAVE_NMI_WATCHDOG, instead of having them
      provide the complete touch_nmi_watchdog() function, just have them
      provide arch_touch_nmi_watchdog().
      
      This gives the generic code more flexibility in implementing this
      function, and arch implementations don't miss out on touching the
      softlockup watchdog or other generic details.
      
      Link: http://lkml.kernel.org/r/20170616065715.18390-3-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NDon Zickus <dzickus@redhat.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2e0cff8
  5. 12 7月, 2017 1 次提交
  6. 03 7月, 2017 1 次提交
  7. 29 6月, 2017 1 次提交
  8. 28 6月, 2017 3 次提交
  9. 26 6月, 2017 14 次提交
  10. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  11. 15 6月, 2017 3 次提交
  12. 13 6月, 2017 8 次提交
  13. 11 6月, 2017 1 次提交