1. 07 7月, 2017 10 次提交
  2. 03 7月, 2017 1 次提交
  3. 01 7月, 2017 1 次提交
  4. 29 6月, 2017 1 次提交
    • J
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe 提交于
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae3b3f5
  5. 24 6月, 2017 1 次提交
    • T
      slub: make sysfs file removal asynchronous · 3b7b3140
      Tejun Heo 提交于
      Commit bf5eb3de ("slub: separate out sysfs_slab_release() from
      sysfs_slab_remove()") made slub sysfs file removals synchronous to
      kmem_cache shutdown.
      
      Unfortunately, this created a possible ABBA deadlock between slab_mutex
      and sysfs draining mechanism triggering the following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        4.10.0-test+ #48 Not tainted
        -------------------------------------------------------
        rmmod/1211 is trying to acquire lock:
         (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40
      
        but task is already holding lock:
         (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (slab_mutex){+.+.+.}:
      	 lock_acquire+0xf6/0x1f0
      	 __mutex_lock+0x75/0x950
      	 mutex_lock_nested+0x1b/0x20
      	 slab_attr_store+0x75/0xd0
      	 sysfs_kf_write+0x45/0x60
      	 kernfs_fop_write+0x13c/0x1c0
      	 __vfs_write+0x28/0x120
      	 vfs_write+0xc8/0x1e0
      	 SyS_write+0x49/0xa0
      	 entry_SYSCALL_64_fastpath+0x1f/0xc2
      
        -> #0 (s_active#120){++++.+}:
      	 __lock_acquire+0x10ed/0x1260
      	 lock_acquire+0xf6/0x1f0
      	 __kernfs_remove+0x254/0x320
      	 kernfs_remove+0x23/0x40
      	 sysfs_remove_dir+0x51/0x80
      	 kobject_del+0x18/0x50
      	 __kmem_cache_shutdown+0x3e6/0x460
      	 kmem_cache_destroy+0x1fb/0x2d0
      	 kvm_exit+0x2d/0x80 [kvm]
      	 vmx_exit+0x19/0xa1b [kvm_intel]
      	 SyS_delete_module+0x198/0x1f0
      	 entry_SYSCALL_64_fastpath+0x1f/0xc2
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(slab_mutex);
      				 lock(s_active#120);
      				 lock(slab_mutex);
          lock(s_active#120);
      
         *** DEADLOCK ***
      
        2 locks held by rmmod/1211:
         #0:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80
         #1:  (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
      
        stack backtrace:
        CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48
        Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
        Call Trace:
         print_circular_bug+0x1be/0x210
         __lock_acquire+0x10ed/0x1260
         lock_acquire+0xf6/0x1f0
         __kernfs_remove+0x254/0x320
         kernfs_remove+0x23/0x40
         sysfs_remove_dir+0x51/0x80
         kobject_del+0x18/0x50
         __kmem_cache_shutdown+0x3e6/0x460
         kmem_cache_destroy+0x1fb/0x2d0
         kvm_exit+0x2d/0x80 [kvm]
         vmx_exit+0x19/0xa1b [kvm_intel]
         SyS_delete_module+0x198/0x1f0
         ? SyS_delete_module+0x5/0x1f0
         entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It'd be the cleanest to deal with the issue by removing sysfs files
      without holding slab_mutex before the rest of shutdown; however, given
      the current code structure, it is pretty difficult to do so.
      
      This patch punts sysfs file removal to a work item.  Before commit
      bf5eb3de, the removal was punted to a RCU delayed work item which is
      executed after release.  Now, we're punting to a different work item on
      shutdown which still maintains the goal removing the sysfs files earlier
      when destroying kmem_caches.
      
      Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org
      Fixes: bf5eb3de ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b7b3140
  6. 22 6月, 2017 1 次提交
  7. 20 6月, 2017 2 次提交
    • J
      time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting · 3d88d56c
      John Stultz 提交于
      Due to how the MONOTONIC_RAW accumulation logic was handled,
      there is the potential for a 1ns discontinuity when we do
      accumulations. This small discontinuity has for the most part
      gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
      in their vDSO clock_gettime implementation, we've seen failures
      with the inconsistency-check test in kselftest.
      
      This patch addresses the issue by using the same sub-ns
      accumulation handling that CLOCK_MONOTONIC uses, which avoids
      the issue for in-kernel users.
      
      Since the ARM64 vDSO implementation has its own clock_gettime
      calculation logic, this patch reduces the frequency of errors,
      but failures are still seen. The ARM64 vDSO will need to be
      updated to include the sub-nanosecond xtime_nsec values in its
      calculation for this issue to be completely fixed.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Tested-by: NDaniel Mentz <danielmentz@google.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "stable #4 . 8+" <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3d88d56c
    • J
      time: Fix clock->read(clock) race around clocksource changes · ceea5e37
      John Stultz 提交于
      In tests, which excercise switching of clocksources, a NULL
      pointer dereference can be observed on AMR64 platforms in the
      clocksource read() function:
      
      u64 clocksource_mmio_readl_down(struct clocksource *c)
      {
      	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
      }
      
      This is called from the core timekeeping code via:
      
      	cycle_now = tkr->read(tkr->clock);
      
      tkr->read is the cached tkr->clock->read() function pointer.
      When the clocksource is changed then tkr->clock and tkr->read
      are updated sequentially. The code above results in a sequential
      load operation of tkr->read and tkr->clock as well.
      
      If the store to tkr->clock hits between the loads of tkr->read
      and tkr->clock, then the old read() function is called with the
      new clock pointer. As a consequence the read() function
      dereferences a different data structure and the resulting 'reg'
      pointer can point anywhere including NULL.
      
      This problem was introduced when the timekeeping code was
      switched over to use struct tk_read_base. Before that, it was
      theoretically possible as well when the compiler decided to
      reload clock in the code sequence:
      
           now = tk->clock->read(tk->clock);
      
      Add a helper function which avoids the issue by reading
      tk_read_base->clock once into a local variable clk and then issue
      the read function via clk->read(clk). This guarantees that the
      read() function always gets the proper clocksource pointer handed
      in.
      
      Since there is now no use for the tkr.read pointer, this patch
      also removes it, and to address stopping the fast timekeeper
      during suspend/resume, it introduces a dummy clocksource to use
      rather then just a dummy read function.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ceea5e37
  8. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  9. 15 6月, 2017 3 次提交
  10. 12 6月, 2017 2 次提交
  11. 09 6月, 2017 1 次提交
    • E
      KEYS: sanitize key structs before freeing · 0620fddb
      Eric Biggers 提交于
      While a 'struct key' itself normally does not contain sensitive
      information, Documentation/security/keys.txt actually encourages this:
      
           "Having a payload is not required; and the payload can, in fact,
           just be a value stored in the struct key itself."
      
      In case someone has taken this advice, or will take this advice in the
      future, zero the key structure before freeing it.  We might as well, and
      as a bonus this could make it a bit more difficult for an adversary to
      determine which keys have recently been in use.
      
      This is safe because the key_jar cache does not use a constructor.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      0620fddb
  12. 08 6月, 2017 3 次提交
    • P
      srcu: Allow use of Classic SRCU from both process and interrupt context · 1123a604
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Cc: stable@vger.kernel.org
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1123a604
    • D
      net: ipv6: Release route when device is unregistering · 8397ed36
      David Ahern 提交于
      Roopa reported attempts to delete a bond device that is referenced in a
      multipath route is hanging:
      
      $ ifdown bond2    # ifupdown2 command that deletes virtual devices
      unregister_netdevice: waiting for bond2 to become free. Usage count = 2
      
      Steps to reproduce:
          echo 1 > /proc/sys/net/ipv6/conf/all/ignore_routes_with_linkdown
          ip link add dev bond12 type bond
          ip link add dev bond13 type bond
          ip addr add 2001:db8:2::0/64 dev bond12
          ip addr add 2001:db8:3::0/64 dev bond13
          ip route add 2001:db8:33::0/64 nexthop via 2001:db8:2::2 nexthop via 2001:db8:3::2
          ip link del dev bond12
          ip link del dev bond13
      
      The root cause is the recent change to keep routes on a linkdown. Update
      the check to detect when the device is unregistering and release the
      route for that case.
      
      Fixes: a1a22c12 ("net: ipv6: Keep nexthop of multipath route on admin down")
      Reported-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8397ed36
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  13. 07 6月, 2017 3 次提交
    • R
      Revert "ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle" · f3b7eaae
      Rafael J. Wysocki 提交于
      Revert commit eed4d47e (ACPI / sleep: Ignore spurious SCI wakeups
      from suspend-to-idle) as it turned out to be premature and triggered
      a number of different issues on various systems.
      
      That includes, but is not limited to, premature suspend-to-RAM aborts
      on Dell XPS 13 (9343) reported by Dominik.
      
      The issue the commit in question attempted to address is real and
      will need to be taken care of going forward, but evidently more work
      is needed for this purpose.
      Reported-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      f3b7eaae
    • D
      compiler, clang: suppress warning for unused static inline functions · abb2ea7d
      David Rientjes 提交于
      GCC explicitly does not warn for unused static inline functions for
      -Wunused-function.  The manual states:
      
      	Warn whenever a static function is declared but not defined or
      	a non-inline static function is unused.
      
      Clang does warn for static inline functions that are unused.
      
      It turns out that suppressing the warnings avoids potentially complex
      #ifdef directives, which also reduces LOC.
      
      Suppress the warning for clang.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abb2ea7d
    • E
      elevator: fix truncation of icq_cache_name · 9bd2bbc0
      Eric Biggers 提交于
      gcc 7.1 reports the following warning:
      
          block/elevator.c: In function ‘elv_register’:
          block/elevator.c:898:5: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=]
               "%s_io_cq", e->elevator_name);
               ^~~~~~~~~~
          block/elevator.c:897:3: note: ‘snprintf’ output between 7 and 22 bytes into a destination of size 21
             snprintf(e->icq_cache_name, sizeof(e->icq_cache_name),
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               "%s_io_cq", e->elevator_name);
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      The bug is that the name of the icq_cache is 6 characters longer than
      the elevator name, but only ELV_NAME_MAX + 5 characters were reserved
      for it --- so in the case of a maximum-length elevator name, the 'q'
      character in "_io_cq" would be truncated by snprintf().  Fix it by
      reserving ELV_NAME_MAX + 6 characters instead.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9bd2bbc0
  14. 05 6月, 2017 1 次提交
  15. 03 6月, 2017 4 次提交
    • M
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko 提交于
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
    • J
      mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified · 9a291a7c
      James Morse 提交于
      KVM uses get_user_pages() to resolve its stage2 faults.  KVM sets the
      FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
      finds a VM_FAULT_HWPOISON.  KVM handles these hwpoison pages as a
      special case.  (check_user_page_hwpoison())
      
      When huge pages are involved, this doesn't work so well.
      get_user_pages() calls follow_hugetlb_page(), which stops early if it
      receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
      -EFAULT to the caller.  The step to map this to -EHWPOISON based on the
      FOLL_ flags is missing.  The hwpoison special case is skipped, and
      -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
      
      Instead, move this VM_FAULT_ to errno mapping code into a header file
      and use it from faultin_page() and follow_hugetlb_page().
      
      With this, KVM works as expected.
      
      This isn't a problem for arm64 today as we haven't enabled
      MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
      too, so I think this should be a fix.  This doesn't apply earlier than
      stable's v4.11.1 due to all sorts of cleanup.
      
      [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
      suggested.
        Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.comSigned-off-by: NJames Morse <james.morse@arm.com>
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.11.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a291a7c
    • M
      frv: declare jiffies to be located in the .data section · 60b0a8c3
      Matthias Kaehlcke 提交于
      Commit 7c30f352 ("jiffies.h: declare jiffies and jiffies_64 with
      ____cacheline_aligned_in_smp") removed a section specification from the
      jiffies declaration that caused conflicts on some platforms.
      
      Unfortunately this change broke the build for frv:
      
        kernel/built-in.o: In function `__do_softirq': (.text+0x6460): relocation truncated to fit: R_FRV_GPREL12 against symbol
            `jiffies' defined in *ABS* section in .tmp_vmlinux1
        kernel/built-in.o: In function `__do_softirq': (.text+0x6574): relocation truncated to fit: R_FRV_GPREL12 against symbol
            `jiffies' defined in *ABS* section in .tmp_vmlinux1
        kernel/built-in.o: In function `pwq_activate_delayed_work': workqueue.c:(.text+0x15b9c): relocation truncated to fit: R_FRV_GPREL12 against
            symbol `jiffies' defined in *ABS* section in .tmp_vmlinux1
        ...
      
      Add __jiffy_arch_data to the declaration of jiffies and use it on frv to
      include the section specification.  For all other platforms
      __jiffy_arch_data (currently) has no effect.
      
      Fixes: 7c30f352 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp")
      Link: http://lkml.kernel.org/r/20170516221333.177280-1-mka@chromium.orgSigned-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60b0a8c3
    • M
      include/linux/gfp.h: fix ___GFP_NOLOCKDEP value · 1bde33e0
      Michal Hocko 提交于
      Igor Stoppa has noticed that __GFP_NOLOCKDEP can use a lower bit.  At
      the time commit 7e784422 ("lockdep: allow to disable reclaim lockup
      detection") was written we still had __GFP_OTHER_NODE but I have removed
      it in commit 41b6167e ("mm: get rid of __GFP_OTHER_NODE") and forgot
      to lower the bit value.
      
      The current value is outside of __GFP_BITS_SHIFT so it cannot be used
      actually.
      
      Fixes: 7e784422 ("lockdep: allow to disable reclaim lockup detection")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NIgor Stoppa <igor.stoppa@nokia.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bde33e0
  16. 02 6月, 2017 1 次提交
  17. 30 5月, 2017 1 次提交
    • A
      iommu/dma: Fix function declaration · 4b1c8898
      Arnd Bergmann 提交于
      Newly added code in the ipmmu-vmsa driver showed a small mistake
      in a header file that can't be included by itself without CONFIG_IOMMU_DMA
      enabled:
      
      In file included from drivers/iommu/ipmmu-vmsa.c:13:0:
      include/linux/dma-iommu.h:105:94: error: 'struct device' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
      
      This adds a forward declaration for 'struct device', similar to how
      we treat the other struct types in this case.
      
      Fixes: 3ae47292 ("iommu/ipmmu-vmsa: Add new IOMMU_DOMAIN_DMA ops")
      Fixes: 273df963 ("iommu/dma: Make PCI window reservation generic")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      4b1c8898
  18. 26 5月, 2017 2 次提交
  19. 25 5月, 2017 1 次提交