1. 19 5月, 2017 1 次提交
    • P
      printk: Use the main logbuf in NMI when logbuf_lock is available · 719f6a70
      Petr Mladek 提交于
      The commit 42a0bb3f ("printk/nmi: generic solution for safe
      printk in NMI") caused that printk stores messages into a temporary
      buffer in NMI context.
      
      The buffer is per-CPU and therefore the size is rather limited.
      It works quite well for NMI backtraces. But there are longer logs
      that might get printed in NMI context, for example, lockdep
      warnings, ftrace_dump_on_oops.
      
      The temporary buffer is used to avoid deadlocks caused by
      logbuf_lock. Also it is needed to avoid races with the other
      temporary buffer that is used when PRINTK_SAFE_CONTEXT is entered.
      But the main buffer can be used in NMI if the lock is available
      and we did not interrupt PRINTK_SAFE_CONTEXT.
      
      The lock is checked using raw_spin_is_locked(). It might cause
      false negatives when the lock is taken on another CPU and
      this CPU is in the safe context from other reasons. Note that
      the safe context is used also to get console semaphore or when
      calling console drivers. For this reason, we do the check in
      printk_nmi_enter(). It makes the handling consistent for
      the entire NMI handler and avoids reshuffling of the messages.
      
      The patch also defines special printk context that allows
      to use printk_deferred() in NMI. Note that we could not flush
      the messages to the consoles because console drivers might use
      many other internal locks.
      
      The newly created vprintk_deferred() disables the preemption
      only around the irq work handling. It is needed there to keep
      the consistency between the two per-CPU variables. But there
      is no reason to disable preemption around vprintk_emit().
      
      Finally, the patch puts back explicit serialization of the NMI
      backtraces from different CPUs. It was removed by the
      commit a9edc880 ("x86/nmi: Perform a safe
      NMI stack trace on all CPUs"). It was not needed because
      the flushing of the temporary per-CPU buffers was serialized.
      
      Link: http://lkml.kernel.org/r/1493912763-24873-1-git-send-email-pmladek@suse.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rack+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: x86@kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      719f6a70
  2. 04 5月, 2017 3 次提交
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • M
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko 提交于
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • N
      lockdep: teach lockdep about memalloc_noio_save · 6d7225f0
      Nikolay Borisov 提交于
      Patch series "scope GFP_NOFS api", v5.
      
      This patch (of 7):
      
      Commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") added the memalloc_noio_(save|restore)
      functions to enable people to modify the MM behavior by disabling I/O
      during memory allocation.
      
      This was further extended in commit 934f3072 ("mm: clear __GFP_FS
      when PF_MEMALLOC_NOIO is set").
      
      memalloc_noio_* functions prevent allocation paths recursing back into
      the filesystem without explicitly changing the flags for every
      allocation site.
      
      However, lockdep hasn't been keeping up with the changes and it entirely
      misses handling the memalloc_noio adjustments.  Instead, it is left to
      the callers of __lockdep_trace_alloc to call the function after they
      have shaven the respective GFP flags which can lead to false positives:
      
        =================================
         [ INFO: inconsistent lock state ]
         4.10.0-nbor #134 Not tainted
         ---------------------------------
         inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
         fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
          (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
         {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x62a/0x17c0
           lock_acquire+0xc5/0x220
           down_write_nested+0x4f/0x90
           xfs_ilock+0x141/0x230
           xfs_reclaim_inode+0x12a/0x320
           xfs_reclaim_inodes_ag+0x2c8/0x4e0
           xfs_reclaim_inodes_nr+0x33/0x40
           xfs_fs_free_cached_objects+0x19/0x20
           super_cache_scan+0x191/0x1a0
           shrink_slab+0x26f/0x5f0
           shrink_node+0xf9/0x2f0
           kswapd+0x356/0x920
           kthread+0x10c/0x140
           ret_from_fork+0x31/0x40
         irq event stamp: 173777
         hardirqs last  enabled at (173777): __local_bh_enable_ip+0x70/0xc0
         hardirqs last disabled at (173775): __local_bh_enable_ip+0x37/0xc0
         softirqs last  enabled at (173776): _xfs_buf_find+0x67a/0xb70
         softirqs last disabled at (173774): _xfs_buf_find+0x5db/0xb70
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&xfs_nondir_ilock_class);
           <Interrupt>
             lock(&xfs_nondir_ilock_class);
      
          *** DEADLOCK ***
      
         4 locks held by fsstress/3365:
          #0:  (sb_writers#10){++++++}, at: mnt_want_write+0x24/0x50
          #1:  (&sb->s_type->i_mutex_key#12){++++++}, at: vfs_setxattr+0x6f/0xb0
          #2:  (sb_internal#2){++++++}, at: xfs_trans_alloc+0xfc/0x140
          #3:  (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
      
         stack backtrace:
         CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
         Call Trace:
          kmem_cache_alloc_node_trace+0x3a/0x2c0
          vm_map_ram+0x2a1/0x510
          _xfs_buf_map_pages+0x77/0x140
          xfs_buf_get_map+0x185/0x2a0
          xfs_attr_rmtval_set+0x233/0x430
          xfs_attr_leaf_addname+0x2d2/0x500
          xfs_attr_set+0x214/0x420
          xfs_xattr_set+0x59/0xb0
          __vfs_setxattr+0x76/0xa0
          __vfs_setxattr_noperm+0x5e/0xf0
          vfs_setxattr+0xae/0xb0
          setxattr+0x15e/0x1a0
          path_setxattr+0x8f/0xc0
          SyS_lsetxattr+0x11/0x20
          entry_SYSCALL_64_fastpath+0x23/0xc6
      
      Let's fix this by making lockdep explicitly do the shaving of respective
      GFP flags.
      
      Fixes: 934f3072 ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
      Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.orgSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d7225f0
  3. 02 5月, 2017 13 次提交
  4. 01 5月, 2017 2 次提交
    • Y
      bpf: enhance verifier to understand stack pointer arithmetic · 332270fd
      Yonghong Song 提交于
      llvm 4.0 and above generates the code like below:
      ....
      440: (b7) r1 = 15
      441: (05) goto pc+73
      515: (79) r6 = *(u64 *)(r10 -152)
      516: (bf) r7 = r10
      517: (07) r7 += -112
      518: (bf) r2 = r7
      519: (0f) r2 += r1
      520: (71) r1 = *(u8 *)(r8 +0)
      521: (73) *(u8 *)(r2 +45) = r1
      ....
      and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
      This is because verifier marks register r2 as unknown value after #519
      where r2 is a stack pointer and r1 holds a constant value.
      
      Teach verifier to recognize "stack_ptr + imm" and
      "stack_ptr + reg with const val" as valid stack_ptr with new offset.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      332270fd
    • D
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams 提交于
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71389703
  5. 29 4月, 2017 3 次提交
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks, take 2 · 9732adc5
      Zefan Li 提交于
      Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two
      different superblocks") is broken.  Now we try to fix the race by
      delaying the initialization of cgroup root refcnt until a superblock
      has been allocated.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
      Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9732adc5
    • H
      bpf: bpf_lock on kallsysms doesn't need to be irqsave · d24f7c7f
      Hannes Frederic Sowa 提交于
      Hannes rightfully spotted that the bpf_lock doesn't need to be
      irqsave variant. We never perform any such updates where this
      would be necessary (neither right now nor in future), therefore
      relax this further.
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d24f7c7f
    • T
      cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() · a590b90d
      Tejun Heo 提交于
      cgroup_get() expected to be called only on live cgroups and triggers
      warning on a dead cgroup; however, cgroup_sk_alloc() may be called
      while cloning a socket which is left in an empty and removed cgroup
      and thus may legitimately duplicate its reference on a dead cgroup.
      This currently triggers the following warning spuriously.
      
       WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
       ...
        [<ffffffff8107e123>] __warn+0xd3/0xf0
        [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
        [<ffffffff810ff465>] cgroup_get+0x55/0x60
        [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
        [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
        [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
        [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
        [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
        [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
        [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
        [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
        [<ffffffff81837cb2>] ip6_input+0x32/0xb0
        [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
        [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
        [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
        [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
        [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
        [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
        [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
        [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
        [<ffffffff81778b30>] net_rx_action+0x210/0x350
        [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
        [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
        [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
        [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
        [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
        [<ffffffff8103edd1>] start_secondary+0xf1/0x100
      
      This patch renames the existing cgroup_get() with the dead cgroup
      warning to cgroup_get_live() after cgroup_kn_lock_live() and
      introduces the new cgroup_get() which doesn't check whether the cgroup
      is live or dead.
      
      All existing cgroup_get() users except for cgroup_sk_alloc() are
      converted to use cgroup_get_live().
      
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Cc: stable@vger.kernel.org # v4.5+
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a590b90d
  6. 27 4月, 2017 1 次提交
  7. 25 4月, 2017 2 次提交
  8. 22 4月, 2017 2 次提交
  9. 21 4月, 2017 3 次提交
  10. 20 4月, 2017 4 次提交
  11. 19 4月, 2017 1 次提交
    • D
      sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL · 395102db
      Daniel Jordan 提交于
      CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
      kernel text, data, and bss fit in the required 32MB limit, but this
      option is not set for every config that enables lockdep.
      
      A 4.10 kernel fails to boot with the console output
      
          Kernel: Using 8 locked TLB entries for main kernel image.
          hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
          Program terminated
      
      with these config options
      
          CONFIG_LOCKDEP=y
          CONFIG_LOCK_STAT=y
          CONFIG_PROVE_LOCKING=n
      
      To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
      enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
      usage every time lockdep is turned on.
      
      Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
      CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
      that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
      CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
      enabled.
      
      Fixes: e6b5f1be ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      395102db
  12. 18 4月, 2017 5 次提交
    • B
      boot/param: Move next_arg() function to lib/cmdline.c for later reuse · f51b17c8
      Baoquan He 提交于
      next_arg() will be used to parse boot parameters in the x86/boot/compressed code,
      so move it to lib/cmdline.c for better code reuse.
      
      No change in functionality.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Larry Finger <Larry.Finger@lwfinger.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dan.j.williams@intel.com
      Cc: dave.jiang@intel.com
      Cc: dyoung@redhat.com
      Cc: keescook@chromium.org
      Cc: zijun_hu <zijun_hu@htc.com>
      Link: http://lkml.kernel.org/r/1492436099-4017-2-git-send-email-bhe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f51b17c8
    • N
      ftrace: Fix function pid filter on instances · d879d0b8
      Namhyung Kim 提交于
      When function tracer has a pid filter, it adds a probe to sched_switch
      to track if current task can be ignored.  The probe checks the
      ftrace_ignore_pid from current tr to filter tasks.  But it misses to
      delete the probe when removing an instance so that it can cause a crash
      due to the invalid tr pointer (use-after-free).
      
      This is easily reproducible with the following:
      
        # cd /sys/kernel/debug/tracing
        # mkdir instances/buggy
        # echo $$ > instances/buggy/set_ftrace_pid
        # rmdir instances/buggy
      
        ============================================================================
        BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
        Read of size 8 by task kworker/0:1/17
        CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G    B           4.11.0-rc3  #198
        Call Trace:
         dump_stack+0x68/0x9f
         kasan_object_err+0x21/0x70
         kasan_report.part.1+0x22b/0x500
         ? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
         kasan_report+0x25/0x30
         __asan_load8+0x5e/0x70
         ftrace_filter_pid_sched_switch_probe+0x3d/0x90
         ? fpid_start+0x130/0x130
         __schedule+0x571/0xce0
         ...
      
      To fix it, use ftrace_clear_pids() to unregister the probe.  As
      instance_rmdir() already updated ftrace codes, it can just free the
      filter safely.
      
      Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org
      
      Fixes: 0c8916c3 ("tracing: Add rmdir to remove multibuffer instances")
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d879d0b8
    • D
      bpf: fix checking xdp_adjust_head on tail calls · c2002f98
      Daniel Borkmann 提交于
      Commit 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      added the xdp_adjust_head bit to the BPF prog in order to tell drivers
      that the program that is to be attached requires support for the XDP
      bpf_xdp_adjust_head() helper such that drivers not supporting this
      helper can reject the program. There are also drivers that do support
      the helper, but need to check for xdp_adjust_head bit in order to move
      packet metadata prepended by the firmware away for making headroom.
      
      For these cases, the current check for xdp_adjust_head bit is insufficient
      since there can be cases where the program itself does not use the
      bpf_xdp_adjust_head() helper, but tail calls into another program that
      uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
      set to 0. Since the first program has no control over which program it
      calls into, we need to assume that bpf_xdp_adjust_head() helper is used
      upon tail calls. Thus, for the very same reasons in cb_access, set the
      xdp_adjust_head bit to 1 when the main program uses tail calls.
      
      Fixes: 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2002f98
    • D
      bpf: fix cb access in socket filter programs on tail calls · 6b1bb01b
      Daniel Borkmann 提交于
      Commit ff936a04 ("bpf: fix cb access in socket filter programs")
      added a fix for socket filter programs such that in i) AF_PACKET the
      20 bytes of skb->cb[] area gets zeroed before use in order to not leak
      data, and ii) socket filter programs attached to TCP/UDP sockets need
      to save/restore these 20 bytes since they are also used by protocol
      layers at that time.
      
      The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
      only look at the actual attached program to determine whether to zero
      or save/restore the skb->cb[] parts. There can be cases where the
      actual attached program does not access the skb->cb[], but the program
      tail calls into another program which does access this area. In such
      a case, the zero or save/restore is currently not performed.
      
      Since the programs we tail call into are unknown at verification time
      and can dynamically change, we need to assume that whenever the attached
      program performs a tail call, that later programs could access the
      skb->cb[], and therefore we need to always set cb_access to 1.
      
      Fixes: ff936a04 ("bpf: fix cb access in socket filter programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b1bb01b
    • M
      bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4 · 695ba265
      Martin KaFai Lau 提交于
      After doing map_perf_test with a much bigger
      BPF_F_NO_COMMON_LRU map, the perf report shows a
      lot of time spent in rotating the inactive list (i.e.
      __bpf_lru_list_rotate_inactive):
      > map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
      19644783 (19M/s)
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      6283930 (6.28M/s)
      
      By inactive, it usually means the element is not in cache.  Hence,
      there is a need to tune the PERCPU_NR_SCANS value.
      
      This patch finds a better number of elements to
      scan during each list rotation.  The PERCPU_NR_SCANS (which
      is defined the same as PERCPU_FREE_TARGET) decreases
      from 16 elements to 4 elements.  This change only
      affects the BPF_F_NO_COMMON_LRU map.
      
      The test_lru_dist does not show meaningful difference
      between 16 and 4.  Our production L4 load balancer which uses
      the LRU map for conntrack-ing also shows little change in cache
      hit rate.  Since both benchmark and production data show no
      cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
      We can consider making it configurable if we find a usecase
      later that shows another value works better and/or use
      a different rotation strategy.
      
      After this change:
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      9240324 (9.2M/s)
      
      i.e. 6.28M/s -> 9.2M/s
      
      The test_lru_dist has not shown meaningful difference:
      > test_lru_dist zipf.100k.a1_01.out 4000 1:
      nr_misses: 31575 (Before) vs 31566 (After)
      
      > test_lru_dist zipf.100k.a0_01.out 40000 1
      nr_misses: 67036 (Before) vs 67031 (After)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      695ba265