1. 29 8月, 2019 2 次提交
    • C
      dma-mapping: make dma_atomic_pool_init self-contained · 8e3a68fb
      Christoph Hellwig 提交于
      The memory allocated for the atomic pool needs to have the same
      mapping attributes that we use for remapping, so use
      pgprot_dmacoherent instead of open coding it.  Also deduct a
      suitable zone to allocate the memory from based on the presence
      of the DMA zones.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8e3a68fb
    • C
      dma-mapping: remove arch_dma_mmap_pgprot · 419e2f18
      Christoph Hellwig 提交于
      arch_dma_mmap_pgprot is used for two things:
      
       1) to override the "normal" uncached page attributes for mapping
          memory coherent to devices that can't snoop the CPU caches
       2) to provide the special DMA_ATTR_WRITE_COMBINE semantics on older
          arm systems and some mips platforms
      
      Replace one with the pgprot_dmacoherent macro that is already provided
      by arm and much simpler to use, and lift the DMA_ATTR_WRITE_COMBINE
      handling to common code with an explicit arch opt-in.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	# m68k
      Acked-by: Paul Burton <paul.burton@mips.com>		# mips
      419e2f18
  2. 23 8月, 2019 1 次提交
  3. 22 8月, 2019 2 次提交
  4. 21 8月, 2019 2 次提交
  5. 20 8月, 2019 1 次提交
    • D
      keys: Fix description size · 555df336
      David Howells 提交于
      The maximum key description size is 4095.  Commit f771fde8 ("keys:
      Simplify key description management") inadvertantly reduced that to 255
      and made sizes between 256 and 4095 work weirdly, and any size whereby
      size & 255 == 0 would cause an assertion in __key_link_begin() at the
      following line:
      
      	BUG_ON(index_key->desc_len == 0);
      
      This can be fixed by simply increasing the size of desc_len in struct
      keyring_index_key to a u16.
      
      Note the argument length test in keyutils only checked empty
      descriptions and descriptions with a size around the limit (ie.  4095)
      and not for all the values in between, so it missed this.  This has been
      addressed and
      
      	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/commit/?id=066bf56807c26cd3045a25f355b34c1d8a20a5aa
      
      now exhaustively tests all possible lengths of type, description and
      payload and then some.
      
      The assertion failure looks something like:
      
       kernel BUG at security/keys/keyring.c:1245!
       ...
       RIP: 0010:__key_link_begin+0x88/0xa0
       ...
       Call Trace:
        key_create_or_update+0x211/0x4b0
        __x64_sys_add_key+0x101/0x200
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It can be triggered by:
      
      	keyctl add user "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" a @s
      
      Fixes: f771fde8 ("keys: Simplify key description management")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      555df336
  6. 19 8月, 2019 3 次提交
    • E
      signal: Allow cifs and drbd to receive their terminating signals · 33da8e7c
      Eric W. Biederman 提交于
      My recent to change to only use force_sig for a synchronous events
      wound up breaking signal reception cifs and drbd.  I had overlooked
      the fact that by default kthreads start out with all signals set to
      SIG_IGN.  So a change I thought was safe turned out to have made it
      impossible for those kernel thread to catch their signals.
      
      Reverting the work on force_sig is a bad idea because what the code
      was doing was very much a misuse of force_sig.  As the way force_sig
      ultimately allowed the signal to happen was to change the signal
      handler to SIG_DFL.  Which after the first signal will allow userspace
      to send signals to these kernel threads.  At least for
      wake_ack_receiver in drbd that does not appear actively wrong.
      
      So correct this problem by adding allow_kernel_signal that will allow
      signals whose siginfo reports they were sent by the kernel through,
      but will not allow userspace generated signals, and update cifs and
      drbd to call allow_kernel_signal in an appropriate place so that their
      thread can receive this signal.
      
      Fixing things this way ensures that userspace won't be able to send
      signals and cause problems, that it is clear which signals the
      threads are expecting to receive, and it guarantees that nothing
      else in the system will be affected.
      
      This change was partly inspired by similar cifs and drbd patches that
      added allow_signal.
      Reported-by: Nronnie sahlberg <ronniesahlberg@gmail.com>
      Reported-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Tested-by: NChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
      Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
      Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig")
      Fixes: 3cf5d076 ("signal: Remove task parameter from force_sig")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      33da8e7c
    • P
      netfilter: nf_tables: map basechain priority to hardware priority · 3bc158f8
      Pablo Neira Ayuso 提交于
      This patch adds initial support for offloading basechains using the
      priority range from 1 to 65535. This is restricting the netfilter
      priority range to 16-bit integer since this is what most drivers assume
      so far from tc. It should be possible to extend this range of supported
      priorities later on once drivers are updated to support for 32-bit
      integer priorities.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bc158f8
    • P
      net: sched: use major priority number as hardware priority · ef01adae
      Pablo Neira Ayuso 提交于
      tc transparently maps the software priority number to hardware. Update
      it to pass the major priority which is what most drivers expect. Update
      drivers too so they do not need to lshift the priority field of the
      flow_cls_common_offload object. The stmmac driver is an exception, since
      this code assumes the tc software priority is fine, therefore, lshift it
      just to be conservative.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef01adae
  7. 17 8月, 2019 1 次提交
  8. 16 8月, 2019 1 次提交
    • J
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe 提交于
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6620d7
  9. 15 8月, 2019 1 次提交
  10. 14 8月, 2019 9 次提交
    • D
      rxrpc: Fix read-after-free in rxrpc_queue_local() · 06d9532f
      David Howells 提交于
      rxrpc_queue_local() attempts to queue the local endpoint it is given and
      then, if successful, prints a trace line.  The trace line includes the
      current usage count - but we're not allowed to look at the local endpoint
      at this point as we passed our ref on it to the workqueue.
      
      Fix this by reading the usage count before queuing the work item.
      
      Also fix the reading of local->debug_id for trace lines, which must be done
      with the same consideration as reading the usage count.
      
      Fixes: 09d2bf59 ("rxrpc: Add a tracepoint to track rxrpc_local refcounting")
      Reported-by: syzbot+78e71c5bab4f76a6a719@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      06d9532f
    • Y
      gpio: Fix build error of function redefinition · 68e03b85
      YueHaibing 提交于
      when do randbuilding, I got this error:
      
      In file included from drivers/hwmon/pmbus/ucd9000.c:19:0:
      ./include/linux/gpio/driver.h:576:1: error: redefinition of gpiochip_add_pin_range
       gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
       ^~~~~~~~~~~~~~~~~~~~~~
      In file included from drivers/hwmon/pmbus/ucd9000.c:18:0:
      ./include/linux/gpio.h:245:1: note: previous definition of gpiochip_add_pin_range was here
       gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
       ^~~~~~~~~~~~~~~~~~~~~~
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: 964cb341 ("gpio: move pincontrol calls to <linux/gpio/driver.h>")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Link: https://lore.kernel.org/r/20190731123814.46624-1-yuehaibing@huawei.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
      68e03b85
    • D
      netlink: Fix nlmsg_parse as a wrapper for strict message parsing · d00ee64e
      David Ahern 提交于
      Eric reported a syzbot warning:
      
      BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
      CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x191/0x1f0 lib/dump_stack.c:113
       kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
       __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
       nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
       rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543
       rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223
       netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg net/socket.c:657 [inline]
       ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
       __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439
       __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      The root cause is nlmsg_parse calling __nla_parse which means the
      header struct size is not checked.
      
      nlmsg_parse should be a wrapper around __nlmsg_parse with
      NL_VALIDATE_STRICT for the validate argument very much like
      nlmsg_parse_deprecated is for NL_VALIDATE_LIBERAL.
      
      Fixes: 3de64403 ("netlink: re-add parse/validate functions in strict mode")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      d00ee64e
    • A
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli 提交于
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • A
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli 提交于
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • Q
      include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used · 0cfaee2a
      Qian Cai 提交于
      A compiler throws a warning on an arm64 system since commit 9849a569
      ("arch, mm: convert all architectures to use 5level-fixup.h"),
      
        mm/kasan/init.c: In function 'kasan_free_p4d':
        mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable]
         p4d_t *p4d;
                ^~~
      
      because p4d_none() in "5level-fixup.h" is compiled away while it is a
      static inline function in "pgtable-nopud.h".
      
      However, if converted p4d_none() to a static inline there, powerpc would
      be unhappy as it reads those in assembler language in
      "arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip
      assembly include for the static inline C function.
      
      While at it, converted a few similar functions to be consistent with the
      ones in "pgtable-nopud.h".
      
      Link: http://lkml.kernel.org/r/20190806232917.881-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cfaee2a
    • R
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin 提交于
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • R
      mm: document zone device struct page field usage · 76470ccd
      Ralph Campbell 提交于
      Patch series "mm/hmm: fixes for device private page migration", v3.
      
      Testing the latest linux git tree turned up a few bugs with page
      migration to and from ZONE_DEVICE private and anonymous pages.
      Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
      mapping and index fields from the source anonymous page mapping.
      
      This patch (of 3):
      
      Struct page for ZONE_DEVICE private pages uses the page->mapping and and
      page->index fields while the source anonymous pages are migrated to
      device private memory.  This is so rmap_walk() can find the page when
      migrating the ZONE_DEVICE private page back to system memory.
      ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
      page->index fields when files are mapped into a process address space.
      
      Add comments to struct page and remove the unused "_zd_pad_1" field to
      make this more clear.
      
      Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76470ccd
    • B
      RDMA/siw: Change CQ flags from 64->32 bits · 2c8ccb37
      Bernard Metzler 提交于
      This patch changes the driver/user shared (mmapped) CQ notification
      flags field from unsigned 64-bits size to unsigned 32-bits size. This
      enables building siw on 32-bit architectures.
      
      This patch changes the siw-abi, but as siw was only just merged in
      this merge window cycle, there are no released kernels with the prior
      abi.  We are making no attempt to be binary compatible with siw user
      space libraries prior to the merge of siw into the upstream kernel,
      only moving forward with upstream kernels and upstream rdma-core
      provided siw libraries are we guaranteeing compatibility.
      Signed-off-by: NBernard Metzler <bmt@zurich.ibm.com>
      Link: https://lore.kernel.org/r/20190809151816.13018-1-bmt@zurich.ibm.comSigned-off-by: NDoug Ledford <dledford@redhat.com>
      2c8ccb37
  11. 12 8月, 2019 1 次提交
  12. 11 8月, 2019 1 次提交
    • C
      dma-mapping: fix page attributes for dma_mmap_* · 33dcb37c
      Christoph Hellwig 提交于
      All the way back to introducing dma_common_mmap we've defaulted to mark
      the pages as uncached.  But this is wrong for DMA coherent devices.
      Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
      flag is only treated special on the alloc side for non-coherent devices.
      
      Introduce a new dma_pgprot helper that deals with the check for coherent
      devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
      and we thus ensure no aliasing of page attributes happens, which makes
      the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
      remaining ones.
      
      Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
      we'll phase it out soon.
      
      Fixes: 64ccc9c0 ("common: dma-mapping: add support for generic dma_mmap_* calls")
      Reported-by: NShawn Anastasio <shawn@anastas.io>
      Reported-by: NGavin Li <git@thegavinli.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
      33dcb37c
  13. 10 8月, 2019 1 次提交
    • D
      sock: make cookie generation global instead of per netns · cd48bdda
      Daniel Borkmann 提交于
      Generating and retrieving socket cookies are a useful feature that is
      exposed to BPF for various program types through bpf_get_socket_cookie()
      helper.
      
      The fact that the cookie counter is per netns is quite a limitation
      for BPF in practice in particular for programs in host namespace that
      use socket cookies as part of a map lookup key since they will be
      causing socket cookie collisions e.g. when attached to BPF cgroup hooks
      or cls_bpf on tc egress in host namespace handling container traffic
      from veth or ipvlan devices with peer in different netns. Change the
      counter to be global instead.
      
      Socket cookie consumers must assume the value as opqaue in any case.
      Not every socket must have a cookie generated and knowledge of the
      counter value itself does not provide much value either way hence
      conversion to global is fine.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Martynas Pumputis <m@lambda.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd48bdda
  14. 09 8月, 2019 6 次提交
    • P
      netfilter: nf_tables: use-after-free in failing rule with bound set · 6a0a8d10
      Pablo Neira Ayuso 提交于
      If a rule that has already a bound anonymous set fails to be added, the
      preparation phase releases the rule and the bound set. However, the
      transaction object from the abort path still has a reference to the set
      object that is stale, leading to a use-after-free when checking for the
      set->bound field. Add a new field to the transaction that specifies if
      the set is bound, so the abort path can skip releasing it since the rule
      command owns it and it takes care of releasing it. After this update,
      the set->bound field is removed.
      
      [   24.649883] Unable to handle kernel paging request at virtual address 0000000000040434
      [   24.657858] Mem abort info:
      [   24.660686]   ESR = 0x96000004
      [   24.663769]   Exception class = DABT (current EL), IL = 32 bits
      [   24.669725]   SET = 0, FnV = 0
      [   24.672804]   EA = 0, S1PTW = 0
      [   24.675975] Data abort info:
      [   24.678880]   ISV = 0, ISS = 0x00000004
      [   24.682743]   CM = 0, WnR = 0
      [   24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000
      [   24.692207] [0000000000040434] pgd=0000000000000000
      [   24.697119] Internal error: Oops: 96000004 [#1] SMP
      [...]
      [   24.889414] Call trace:
      [   24.891870]  __nf_tables_abort+0x3f0/0x7a0
      [   24.895984]  nf_tables_abort+0x20/0x40
      [   24.899750]  nfnetlink_rcv_batch+0x17c/0x588
      [   24.904037]  nfnetlink_rcv+0x13c/0x190
      [   24.907803]  netlink_unicast+0x18c/0x208
      [   24.911742]  netlink_sendmsg+0x1b0/0x350
      [   24.915682]  sock_sendmsg+0x4c/0x68
      [   24.919185]  ___sys_sendmsg+0x288/0x2c8
      [   24.923037]  __sys_sendmsg+0x7c/0xd0
      [   24.926628]  __arm64_sys_sendmsg+0x2c/0x38
      [   24.930744]  el0_svc_common.constprop.0+0x94/0x158
      [   24.935556]  el0_svc_handler+0x34/0x90
      [   24.939322]  el0_svc+0x8/0xc
      [   24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863)
      [   24.948336] ---[ end trace cebbb9dcbed3b56f ]---
      
      Fixes: f6ac8585 ("netfilter: nf_tables: unbind set in rule from commit path")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6a0a8d10
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
    • G
      inet: frags: re-introduce skb coalescing for local delivery · 891584f4
      Guillaume Nault 提交于
      Before commit d4289fcc ("net: IP6 defrag: use rbtrees for IPv6
      defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
      generating many fragments) and running over an IPsec tunnel, reported
      more than 6Gbps throughput. After that patch, the same test gets only
      9Mbps when receiving on a be2net nic (driver can make a big difference
      here, for example, ixgbe doesn't seem to be affected).
      
      By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
      (IPv4 fragment coalescing was dropped by commit 14fe22e3 ("Revert
      "ipv4: use skb coalescing in defragmentation"")).
      
      Without fragment coalescing, be2net runs out of Rx ring entries and
      starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
      the netperf traffic is only composed of UDP fragments, any lost packet
      prevents reassembly of the full datagram. Therefore, fragments which
      have no possibility to ever get reassembled pile up in the reassembly
      queue, until the memory accounting exeeds the threshold. At that point
      no fragment is accepted anymore, which effectively discards all
      netperf traffic.
      
      When reassembly timeout expires, some stale fragments are removed from
      the reassembly queue, so a few packets can be received, reassembled
      and delivered to the netperf receiver. But the nic still drops frames
      and soon the reassembly queue gets filled again with stale fragments.
      These long time frames where no datagram can be received explain why
      the performance drop is so significant.
      
      Re-introducing fragment coalescing is enough to get the initial
      performances again (6.6Gbps with be2net): driver doesn't drop frames
      anymore (no more rx_drops_no_frags errors) and the reassembly engine
      works at full speed.
      
      This patch is quite conservative and only coalesces skbs for local
      IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
      forwarding). Coalescing could be extended in the future if need be, as
      more scenarios would probably benefit from it.
      
      [0]: Test configuration
      Sender:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      netserver -D -L fc00:2::1
      
      Receiver:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      891584f4
    • T
      net/mlx5e: kTLS, Fix progress params context WQE layout · a9bc3390
      Tariq Toukan 提交于
      The TLS progress params context WQE should not include an
      Eth segment, drop it.
      In addition, align the tls_progress_params layout with the
      HW specification document:
      - fix the tisn field name.
      - remove the valid bit.
      
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Fixes: d2ead1f3 ("net/mlx5e: Add kTLS TX HW offload support")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      a9bc3390
    • T
      net/mlx5: kTLS, Fix wrong TIS opmod constants · 26149e3e
      Tariq Toukan 提交于
      Fix the used constants for TLS TIS opmods, per the HW specification.
      
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      26149e3e
    • M
      auxdisplay: charlcd: move charlcd.h to drivers/auxdisplay · 75354284
      Masahiro Yamada 提交于
      This header is included in drivers/auxdisplay/. Make it a local header.
      Reviewed-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      75354284
  15. 07 8月, 2019 3 次提交
    • A
      Revert "drm/amdkfd: New IOCTL to allocate queue GWS" · 4b3e30ed
      Alex Deucher 提交于
      This reverts commit 1a058c33.
      
      This interface is still in too much flux.  Revert until
      it's sorted out.
      Acked-by: NOak Zeng <Oak.Zeng@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      4b3e30ed
    • V
      net: sched: sample: allow accessing psample_group with rtnl · 67cbf7de
      Vlad Buslov 提交于
      Recently implemented support for sample action in flow_offload infra leads
      to following rcu usage warning:
      
      [ 1938.234856] =============================
      [ 1938.234858] WARNING: suspicious RCU usage
      [ 1938.234863] 5.3.0-rc1+ #574 Not tainted
      [ 1938.234866] -----------------------------
      [ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage!
      [ 1938.234872]
                     other info that might help us debug this:
      
      [ 1938.234875]
                     rcu_scheduler_active = 2, debug_locks = 1
      [ 1938.234879] 1 lock held by tc/19540:
      [ 1938.234881]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
      [ 1938.234900]
                     stack backtrace:
      [ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574
      [ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [ 1938.234911] Call Trace:
      [ 1938.234922]  dump_stack+0x85/0xc0
      [ 1938.234930]  tc_setup_flow_action+0xed5/0x2040
      [ 1938.234944]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
      [ 1938.234965]  fl_change+0xd24/0x1b30 [cls_flower]
      [ 1938.234990]  tc_new_tfilter+0x3e0/0x970
      [ 1938.235021]  ? tc_del_tfilter+0x720/0x720
      [ 1938.235028]  rtnetlink_rcv_msg+0x389/0x4b0
      [ 1938.235038]  ? netlink_deliver_tap+0x95/0x400
      [ 1938.235044]  ? rtnl_dellink+0x2d0/0x2d0
      [ 1938.235053]  netlink_rcv_skb+0x49/0x110
      [ 1938.235063]  netlink_unicast+0x171/0x200
      [ 1938.235073]  netlink_sendmsg+0x224/0x3f0
      [ 1938.235091]  sock_sendmsg+0x5e/0x60
      [ 1938.235097]  ___sys_sendmsg+0x2ae/0x330
      [ 1938.235111]  ? __handle_mm_fault+0x12cd/0x19e0
      [ 1938.235125]  ? __handle_mm_fault+0x12cd/0x19e0
      [ 1938.235138]  ? find_held_lock+0x2b/0x80
      [ 1938.235147]  ? do_user_addr_fault+0x22d/0x490
      [ 1938.235160]  __sys_sendmsg+0x59/0xa0
      [ 1938.235178]  do_syscall_64+0x5c/0xb0
      [ 1938.235187]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1938.235192] RIP: 0033:0x7ff9a4d597b8
      [ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
       ec 28 89 54
      [ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8
      [ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003
      [ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
      [ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
      [ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001
      
      Change tcf_sample_psample_group() helper to allow using it from both rtnl
      and rcu protected contexts.
      
      Fixes: a7a7be60 ("net/sched: add sample action to the hardware intermediate representation")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67cbf7de
    • V
      net: sched: police: allow accessing police->params with rtnl · c4bd4869
      Vlad Buslov 提交于
      Recently implemented support for police action in flow_offload infra leads
      to following rcu usage warning:
      
      [ 1925.881092] =============================
      [ 1925.881094] WARNING: suspicious RCU usage
      [ 1925.881098] 5.3.0-rc1+ #574 Not tainted
      [ 1925.881100] -----------------------------
      [ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage!
      [ 1925.881106]
                     other info that might help us debug this:
      
      [ 1925.881109]
                     rcu_scheduler_active = 2, debug_locks = 1
      [ 1925.881112] 1 lock held by tc/18591:
      [ 1925.881115]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
      [ 1925.881124]
                     stack backtrace:
      [ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574
      [ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
      [ 1925.881132] Call Trace:
      [ 1925.881138]  dump_stack+0x85/0xc0
      [ 1925.881145]  tc_setup_flow_action+0x1771/0x2040
      [ 1925.881155]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
      [ 1925.881175]  fl_change+0xd24/0x1b30 [cls_flower]
      [ 1925.881200]  tc_new_tfilter+0x3e0/0x970
      [ 1925.881231]  ? tc_del_tfilter+0x720/0x720
      [ 1925.881243]  rtnetlink_rcv_msg+0x389/0x4b0
      [ 1925.881250]  ? netlink_deliver_tap+0x95/0x400
      [ 1925.881257]  ? rtnl_dellink+0x2d0/0x2d0
      [ 1925.881264]  netlink_rcv_skb+0x49/0x110
      [ 1925.881275]  netlink_unicast+0x171/0x200
      [ 1925.881284]  netlink_sendmsg+0x224/0x3f0
      [ 1925.881299]  sock_sendmsg+0x5e/0x60
      [ 1925.881305]  ___sys_sendmsg+0x2ae/0x330
      [ 1925.881309]  ? task_work_add+0x43/0x50
      [ 1925.881314]  ? fput_many+0x45/0x80
      [ 1925.881329]  ? __lock_acquire+0x248/0x1930
      [ 1925.881342]  ? find_held_lock+0x2b/0x80
      [ 1925.881347]  ? task_work_run+0x7b/0xd0
      [ 1925.881359]  __sys_sendmsg+0x59/0xa0
      [ 1925.881375]  do_syscall_64+0x5c/0xb0
      [ 1925.881381]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1925.881384] RIP: 0033:0x7feb245047b8
      [ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
       ec 28 89 54
      [ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8
      [ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003
      [ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
      [ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
      [ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001
      
      Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to
      allow using them from both rtnl and rcu protected contexts.
      
      Fixes: 8c8cfc6e ("net/sched: add police action to the hardware intermediate representation")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: NPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4bd4869
  16. 06 8月, 2019 1 次提交
  17. 05 8月, 2019 4 次提交
    • M
      KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block · 5eeaf10e
      Marc Zyngier 提交于
      Since commit commit 328e5664 ("KVM: arm/arm64: vgic: Defer
      touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
      its GICv2 equivalent) loaded as long as we can, only syncing it
      back when we're scheduled out.
      
      There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
      which is indirectly called from kvm_vcpu_check_block(), needs to
      evaluate the guest's view of ICC_PMR_EL1. At the point were we
      call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
      changes to PMR is not visible in memory until we do a vcpu_put().
      
      Things go really south if the guest does the following:
      
      	mov x0, #0	// or any small value masking interrupts
      	msr ICC_PMR_EL1, x0
      
      	[vcpu preempted, then rescheduled, VMCR sampled]
      
      	mov x0, #ff	// allow all interrupts
      	msr ICC_PMR_EL1, x0
      	wfi		// traps to EL2, so samping of VMCR
      
      	[interrupt arrives just after WFI]
      
      Here, the hypervisor's view of PMR is zero, while the guest has enabled
      its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
      interrupts are pending (despite an interrupt being received) and we'll
      block for no reason. If the guest doesn't have a periodic interrupt
      firing once it has blocked, it will stay there forever.
      
      To avoid this unfortuante situation, let's resync VMCR from
      kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
      will observe the latest value of PMR.
      
      This has been found by booting an arm64 Linux guest with the pseudo NMI
      feature, and thus using interrupt priorities to mask interrupts instead
      of the usual PSTATE masking.
      
      Cc: stable@vger.kernel.org # 4.12
      Fixes: 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      5eeaf10e
    • G
      KVM: no need to check return value of debugfs_create functions · 3e7093d0
      Greg KH 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
      void instead of an integer, as we should not care at all about if this
      function actually does anything or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <x86@kernel.org>
      Cc: <kvm@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e7093d0
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5