1. 06 10月, 2016 1 次提交
    • J
      mm: filemap: don't plant shadow entries without radix tree node · d3798ae8
      Johannes Weiner 提交于
      When the underflow checks were added to workingset_node_shadow_dec(),
      they triggered immediately:
      
        kernel BUG at ./include/linux/swap.h:276!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
         soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
        CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
        RIP: page_cache_tree_insert+0xf1/0x100
        Call Trace:
          __add_to_page_cache_locked+0x12e/0x270
          add_to_page_cache_lru+0x4e/0xe0
          mpage_readpages+0x112/0x1d0
          blkdev_readpages+0x1d/0x20
          __do_page_cache_readahead+0x1ad/0x290
          force_page_cache_readahead+0xaa/0x100
          page_cache_sync_readahead+0x3f/0x50
          generic_file_read_iter+0x5af/0x740
          blkdev_read_iter+0x35/0x40
          __vfs_read+0xe1/0x130
          vfs_read+0x96/0x130
          SyS_read+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x13/0x8f
        Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
        RIP  page_cache_tree_insert+0xf1/0x100
      
      This is a long-standing bug in the way shadow entries are accounted in
      the radix tree nodes. The shrinker needs to know when radix tree nodes
      contain only shadow entries, no pages, so node->count is split in half
      to count shadows in the upper bits and pages in the lower bits.
      
      Unfortunately, the radix tree implementation doesn't know of this and
      assumes all entries are in node->count. When there is a shadow entry
      directly in root->rnode and the tree is later extended, the radix tree
      implementation will copy that entry into the new node and and bump its
      node->count, i.e. increases the page count bits. Once the shadow gets
      removed and we subtract from the upper counter, node->count underflows
      and triggers the warning. Afterwards, without node->count reaching 0
      again, the radix tree node is leaked.
      
      Limit shadow entries to when we have actual radix tree nodes and can
      count them properly. That means we lose the ability to detect refaults
      from files that had only the first page faulted in at eviction time.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3798ae8
  2. 04 10月, 2016 8 次提交
  3. 01 10月, 2016 2 次提交
    • J
      include/linux/property.h: fix typo/compile error · 37aa7271
      John Youn 提交于
      This fixes commit d76eebfa ("include/linux/property.h: fix build
      issues with gcc-4.4.4").
      
      With that commit we get the following compile error when using the
      PROPERTY_ENTRY_INTEGER_ARRAY macro.
      
       include/linux/property.h:201:39: error: `u32_data' undeclared (first
                       use in this function)
        PROPERTY_ENTRY_INTEGER_ARRAY(_name_, u32, _val_)
                                             ^
       include/linux/property.h:193:17: note: in definition of macro
                       `PROPERTY_ENTRY_INTEGER_ARRAY'
        { .pointer = { _type_##_data = _val_ } },  \
                       ^
      
      This needs a '.' to reference the union member.  It seems this was just
      overlooked here since it is done correctly in similar constructs in
      other parts of the original commit.
      
      This fix is in preparation of upcoming commits that will use this macro.
      
      Fixes: commit d76eebfa ("include/linux/property.h: fix build issues with gcc-4.4.4")
      Link: http://lkml.kernel.org/r/2de3b929290d88a723ed829a3e3cbd02044714df.1475114627.git.johnyoun@synopsys.comSigned-off-by: NJohn Youn <johnyoun@synopsys.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37aa7271
    • J
      mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() · 22f2ac51
      Johannes Weiner 提交于
      Antonio reports the following crash when using fuse under memory pressure:
      
        kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: all of them
        CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
        RIP: shadow_lru_isolate+0x181/0x190
        Call Trace:
          __list_lru_walk_one.isra.3+0x8f/0x130
          list_lru_walk_one+0x23/0x30
          scan_shadow_nodes+0x34/0x50
          shrink_slab.part.40+0x1ed/0x3d0
          shrink_zone+0x2ca/0x2e0
          kswapd+0x51e/0x990
          kthread+0xd8/0xf0
          ret_from_fork+0x3f/0x70
      
      which corresponds to the following sanity check in the shadow node
      tracking:
      
        BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
      
      The workingset code tracks radix tree nodes that exclusively contain
      shadow entries of evicted pages in them, and this (somewhat obscure)
      line checks whether there are real pages left that would interfere with
      reclaim of the radix tree node under memory pressure.
      
      While discussing ways how fuse might sneak pages into the radix tree
      past the workingset code, Miklos pointed to replace_page_cache_page(),
      and indeed there is a problem there: it properly accounts for the old
      page being removed - __delete_from_page_cache() does that - but then
      does a raw raw radix_tree_insert(), not accounting for the replacement
      page.  Eventually the page count bits in node->count underflow while
      leaving the node incorrectly linked to the shadow node LRU.
      
      To address this, make sure replace_page_cache_page() uses the tracked
      page insertion code, page_cache_tree_insert().  This fixes the page
      accounting and makes sure page-containing nodes are properly unlinked
      from the shadow node LRU again.
      
      Also, make the sanity checks a bit less obscure by using the helpers for
      checking the number of pages and shadows in a radix tree node.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link>
      Debugged-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f2ac51
  4. 30 9月, 2016 10 次提交
    • F
      u64_stats: Introduce IRQs disabled helpers · 68107df5
      Frederic Weisbecker 提交于
      Introduce light versions of u64_stats helpers for context where
      either preempt or IRQs are disabled. This way we can make this library
      usable by scheduler irqtime accounting which currenty implement its
      ad-hoc version.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Link: http://lkml.kernel.org/r/1474849761-12678-4-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68107df5
    • P
      sched/core, ia64: Rename set_curr_task() · a458ae2e
      Peter Zijlstra 提交于
      Rename the ia64 only set_curr_task() function to free up the name.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a458ae2e
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • P
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra 提交于
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      24fc7edb
    • O
      sched/wait: Introduce init_wait_entry() · 0176beaf
      Oleg Nesterov 提交于
      The partial initialization of wait_queue_t in prepare_to_wait_event() looks
      ugly. This was done to shrink .text, but we can simply add the new helper
      which does the full initialization and shrink the compiled code a bit more.
      
      And. This way prepare_to_wait_event() can have more users. In particular we
      are ready to remove the signal_pending_state() checks from wait_bit_action_f
      helpers and change __wait_on_bit_lock() to use prepare_to_wait_event().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140055.GA6167@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0176beaf
    • O
      sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() · eaf9ef52
      Oleg Nesterov 提交于
      __wait_on_bit_lock() doesn't need abort_exclusive_wait() too. Right
      now it can't use prepare_to_wait_event() (see the next change), but
      it can do the additional finish_wait() if action() fails.
      
      abort_exclusive_wait() no longer has callers, remove it.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140053.GA6164@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eaf9ef52
    • O
      sched/wait: Avoid abort_exclusive_wait() in ___wait_event() · b1ea06a9
      Oleg Nesterov 提交于
      ___wait_event() doesn't really need abort_exclusive_wait(), we can simply
      change prepare_to_wait_event() to remove the waiter from q->task_list if
      it was interrupted.
      
      This simplifies the code/logic, and this way prepare_to_wait_event() can
      have more users, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160908164815.GA18801@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      --
       include/linux/wait.h |    7 +------
       kernel/sched/wait.c  |   35 +++++++++++++++++++++++++----------
       2 files changed, 26 insertions(+), 16 deletions(-)
      b1ea06a9
    • O
      sched/wait: Fix abort_exclusive_wait(), it should pass TASK_NORMAL to wake_up() · 38a3e1fc
      Oleg Nesterov 提交于
      Otherwise this logic only works if mode is "compatible" with another
      exclusive waiter.
      
      If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
      abort_exclusive_wait() won't wait an uninterruptible waiter.
      
      The main user is __wait_on_bit_lock() and currently it is fine but only
      because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
      lock_page_interruptible() yet.
      
      Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
      Yes, this means that (say) wake_up_interruptible() can wake up the non-
      interruptible waiter(s), but I think this is fine. And in fact I think
      that abort_exclusive_wait() must die, see the next change.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160906140047.GA6157@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38a3e1fc
    • M
      ipv6 addrconf: implement RFC7559 router solicitation backoff · bd11f074
      Maciej Żenczykowski 提交于
      This implements:
        https://tools.ietf.org/html/rfc7559
      
      Backoff is performed according to RFC3315 section 14:
        https://tools.ietf.org/html/rfc3315#section-14
      
      We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations
      to a negative value meaning an unlimited number of retransmits,
      and we make this the new default (inline with the RFC).
      
      We also add a new setting:
        /proc/sys/net/ipv6/conf/*/router_solicitation_max_interval
      defaulting to 1 hour (per RFC recommendation).
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Acked-by: NErik Kline <ek@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd11f074
  5. 29 9月, 2016 2 次提交
    • J
      bpf: allow access into map value arrays · 48461135
      Josef Bacik 提交于
      Suppose you have a map array value that is something like this
      
      struct foo {
      	unsigned iter;
      	int array[SOME_CONSTANT];
      };
      
      You can easily insert this into an array, but you cannot modify the contents of
      foo->array[] after the fact.  This is because we have no way to verify we won't
      go off the end of the array at verification time.  This patch provides a start
      for this work.  We accomplish this by keeping track of a minimum and maximum
      value a register could be while we're checking the code.  Then at the time we
      try to do an access into a MAP_VALUE we verify that the maximum offset into that
      region is a valid access into that memory region.  So in practice, code such as
      this
      
      unsigned index = 0;
      
      if (foo->iter >= SOME_CONSTANT)
      	foo->iter = index;
      else
      	index = foo->iter++;
      foo->array[index] = bar;
      
      would be allowed, as we can verify that index will always be between 0 and
      SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an extra
      check to make sure the index isn't less than 0, or do something like index %=
      SOME_CONSTANT.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48461135
    • A
      dma-mapping.h: preserve unmap info for CONFIG_DMA_API_DEBUG · 2481366a
      Andrey Smirnov 提交于
      When CONFIG_DMA_API_DEBUG is enabled we need to preserve unmapping address
      even if "unmap" is a no-op for our architecutre because we need
      debug_dma_unmap_page() to correctly cleanup all of the debug bookkeeping.
      Failing to do so results in a false positive warnings about previously
      mapped areas never being unmapped.
      
      Link: http://lkml.kernel.org/r/1474387125-3713-1-git-send-email-andrew.smirnov@gmail.comSigned-off-by: NAndrey Smirnov <andrew.smirnov@gmail.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Zhen Lei <thunder.leizhen@huawei.com>
      Cc: "Luis R. Rodriguez" <mcgrof@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2481366a
  6. 28 9月, 2016 2 次提交
  7. 27 9月, 2016 7 次提交
  8. 26 9月, 2016 2 次提交
    • N
      ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route · 2cf75070
      Nikolay Aleksandrov 提交于
      Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
      instead of the previous dst_pid which was copied from in_skb's portid.
      Since the skb is new the portid is 0 at that point so the packets are sent
      to the kernel and we get scheduling while atomic or a deadlock (depending
      on where it happens) by trying to acquire rtnl two times.
      Also since this is RTM_GETROUTE, it can be triggered by a normal user.
      
      Here's the sleeping while atomic trace:
      [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
      [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
      [ 7858.212881] 2 locks held by swapper/0/0:
      [ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
      [ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
      [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
      [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
      [ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
      [ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
      [ 7858.215251] Call Trace:
      [ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
      [ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
      [ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
      [ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
      [ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
      [ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
      [ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
      [ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
      [ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
      [ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
      [ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
      [ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
      [ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
      [ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
      [ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
      [ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
      [ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
      [ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
      [ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
      [ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
      [ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
      [ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
      [ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
      [ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
      [ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
      [ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
      [ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
      [ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
      [ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a
      
      Fixes: 2942e900 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cf75070
    • D
      fault_in_multipages_readable() throws set-but-unused error · 90b75db6
      Dave Chinner 提交于
      When building XFS with -Werror, it now fails with:
      
        include/linux/pagemap.h: In function 'fault_in_multipages_readable':
        include/linux/pagemap.h:602:16: error: variable 'c' set but not used [-Werror=unused-but-set-variable]
          volatile char c;
                        ^
      
      This is a regression caused by commit e23d4159 ("fix
      fault_in_multipages_...() on architectures with no-op access_ok()").
      Fix it by re-adding the "(void)c" trick taht was previously used to make
      the compiler think the variable is used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b75db6
  9. 25 9月, 2016 2 次提交
  10. 24 9月, 2016 4 次提交
    • M
      net/mlx4: Add VF vlan protocol 802.1ad support · b42959dc
      Moshe Shemesh 提交于
      Move the vf to VST 802.1ad mode (mlx4 VST QinQ mode) by setting vf vlan
      protocol to 802.1ad.
      VST 802.1ad mode in mlx4, is used for STAG strip/insertion by PF, while
      the CTAG is set by the VF.
      Read current vlan protocol as part of the vf configuration state.
      
      Upon setting vf vlan protocol to 802.1ad, we use a mechanism of handshake
      to verify that both the vf and the pf driver version support it.
      The handshake uses the command QUERY_FUNC_CAP:
      - The vf sets a pre-defined support bit in input modifier.
      - A pf that supports the feature sends the request to the vf through a
        pre-defined field in the output mailbox.
      - In case vf does not support the feature, the pf will fail the control
        command (in this case, IP link tool command to set the vf vlan
        protocol to 802.1ad).
      
      No change in VST 802.1Q mode.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b42959dc
    • M
      net: Update API for VF vlan protocol 802.1ad support · 79aab093
      Moshe Shemesh 提交于
      Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
      the ability for user-space application to specify it for the VF, as an
      option to support 802.1ad.
      We adjusted IP Link tool to support this option.
      
      For future use cases, the new UAPI supports multiple vlans. For now we
      limit the list size to a single vlan in kernel.
      Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
      compatibility with older versions of IP Link tool.
      
      Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
      We kept 802.1Q as the drivers' default vlan protocol.
      Suitable ip link tool command examples:
        Set vf vlan protocol 802.1ad:
          ip link set eth0 vf 1 vlan 100 proto 802.1ad
        Set vf to VST (802.1Q) mode:
          ip link set eth0 vf 1 vlan 100 proto 802.1Q
        Or by omitting the new parameter
          ip link set eth0 vf 1 vlan 100
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79aab093
    • M
      net/mlx4_core: Preparation for VF vlan protocol 802.1ad · 7c3d21c8
      Moshe Shemesh 提交于
      Check device capability to support VF vlan protocol 802.1ad mode.
      Add vport attribute vlan protocol.
      Init vport vlan protocol by default to 802.1Q.
      Add update QP support for VF vlan protocol 802.1ad.
      Add func capability vlan_offload_disable to disable all
      vlan HW acceleration on VF while the VF is set to VF vlan protocol
      802.1ad mode.
      No change in VF vlan protocol 802.1Q (VST) mode.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c3d21c8
    • M
      thread_info: Use unsigned long for flags · 907241dc
      Mark Rutland 提交于
      The generic THREAD_INFO_IN_TASK definition of thread_info::flags is a
      u32, matching x86 prior to the introduction of THREAD_INFO_IN_TASK.
      
      However, common helpers like test_ti_thread_flag() implicitly assume
      that thread_info::flags has at least the size and alignment of unsigned
      long, and relying on padding and alignment provided by other elements of
      task_struct is somewhat fragile. Additionally, some architectures use
      more that 32 bits for thread_info::flags, and others may need to in
      future.
      
      With THREAD_INFO_IN_TASK, task struct follows thread_info with a long
      field, and thus we no longer save any space as we did back in commit:
      
        affa219b ("x86: change thread_info's flag field back to 32 bits")
      
      Given all this, it makes more sense for the generic thread_info::flags
      to be an unsigned long.
      
      In fact given <linux/thread_info.h> contains/uses the helpers mentioned
      above, BE arches *must* use unsigned long (or something of the same size)
      today, or they wouldn't work.
      
      Make it so.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1474651447-30447-1-git-send-email-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      907241dc