1. 25 12月, 2018 1 次提交
    • P
      net: dccp: fix kernel crash on module load · c92c81df
      Peter Oskolkov 提交于
      Patch eedbbb0d "net: dccp: initialize (addr,port) ..."
      added calling to inet_hashinfo2_init() from dccp_init().
      
      However, inet_hashinfo2_init() is marked as __init(), and
      thus the kernel panics when dccp is loaded as module. Removing
      __init() tag from inet_hashinfo2_init() is not feasible because
      it calls into __init functions in mm.
      
      This patch adds inet_hashinfo2_init_mod() function that can
      be called after the init phase is done; changes dccp_init() to
      call the new function; un-marks inet_hashinfo2_init() as
      exported.
      
      Fixes: eedbbb0d ("net: dccp: initialize (addr,port) ...")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92c81df
  2. 23 12月, 2018 4 次提交
  3. 22 12月, 2018 1 次提交
  4. 21 12月, 2018 13 次提交
    • S
      Revert "compiler-gcc: disable -ftracer for __noclone functions" · 2bcbd406
      Sean Christopherson 提交于
      The -ftracer optimization was disabled in __noclone as a workaround to
      GCC duplicating a blob of inline assembly that happened to define a
      global variable.  It has been pointed out that no amount of workarounds
      can guarantee the compiler won't duplicate inline assembly[1], and that
      disabling the -ftracer optimization has several unintended and nasty
      side effects[2][3].
      
      Now that the offending KVM code which required the workaround has
      been properly fixed and no longer uses __noclone, remove the -ftracer
      optimization tweak from __noclone.
      
      [1] https://lore.kernel.org/lkml/ri6y38lo23g.fsf@suse.cz/T/#u
      [2] https://lore.kernel.org/lkml/20181218140105.ajuiglkpvstt3qxs@treble/T/#u
      [3] https://patchwork.kernel.org/patch/8707981/#21817015
      
      This reverts commit 95272c29.
      Suggested-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bcbd406
    • J
      kvm: Change offset in kvm_write_guest_offset_cached to unsigned · 7a86dab8
      Jim Mattson 提交于
      Since the offset is added directly to the hva from the
      gfn_to_hva_cache, a negative offset could result in an out of bounds
      write. The existing BUG_ON only checks for addresses beyond the end of
      the gfn_to_hva_cache, not for addresses before the start of the
      gfn_to_hva_cache.
      
      Note that all current call sites have non-negative offsets.
      
      Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
      Reported-by: NCfir Cohen <cfir@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NCfir Cohen <cfir@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7a86dab8
    • T
      net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE · 5e0d2eef
      Tariq Toukan 提交于
      Add support for the HW feature of multi-packet WQE in XDP
      xmit flow.
      
      The conventional TX descriptor (WQE, Work Queue Element) serves
      a single packet. Our HW has support for multi-packet WQE (MPWQE)
      in which a single descriptor serves multiple TX packets.
      
      This reduces both the PCI overhead and the CPU cycles wasted on
      writing them.
      
      In this patch we add support for the HW feature, which is supported
      starting from ConnectX-5.
      
      Performance:
      Tested packet rate for UDP 64Byte multi-stream over ConnectX-5 NICs.
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_TX:
      We see a huge gain on single port ConnectX-5, and reach the 100 Mpps
      milestone.
      * Single-port HCA:
      	Before:   70 Mpps
      	After:   100 Mpps (+42.8%)
      
      * Dual-port HCA:
      	Before: 51.7 Mpps
      	After:  57.3 Mpps (+10.8%)
      
      * In both cases we tested traffic on one port and for now On Dual-port HCAs
        we see only small gain, we are working to overcome this bottleneck, but
        for the moment only with experimental firmware on dual port HCAs we can
        reach the wanted numbers as seen on Single-port HCAs.
      
      XDP_REDIRECT:
      Redirect from (A) ConnectX-5 to (B) ConnectX-5.
      Due to a setup limitation, (A) and (B) are on different NUMA nodes,
      so absolute performance numbers are not optimal.
      Note:
        Below is the transmit rate of (B), not the redirect rate of (A)
        which is in some cases higher.
      
      * (B) is single-port:
      	Before:   77 Mpps
      	After:    90 Mpps (+16.8%)
      
      * (B) is dual-port:
      	Before:  61 Mpps
      	After:   72 Mpps (+18%)
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      5e0d2eef
    • A
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy 提交于
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • P
      net: seg6.h: remove an unused #include · a6ae520d
      Peter Oskolkov 提交于
      A minor code cleanup.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6ae520d
    • S
      linux/netlink.h: drop unnecessary extern prefix · aa9d6e0f
      Stephen Hemminger 提交于
      Don't need extern prefix before function prototypes.
      Checkpatch has complained about this for a couple of years.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa9d6e0f
    • F
      netfilter: netns: shrink netns_ct struct · 8527f9df
      Florian Westphal 提交于
      remove the obsolete sysctl anchors and move auto_assign_helper_warned
      to avoid/cover a hole.  Reduces size by 40 bytes on 64 bit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8527f9df
    • F
      netfilter: conntrack: remove empty pernet fini stubs · fc3893fd
      Florian Westphal 提交于
      after moving sysctl handling into single place, the init functions
      can't fail anymore and some of the fini functions are empty.
      
      Remove them and change return type to void.
      This also simplifies error unwinding in conntrack module init path.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fc3893fd
    • F
      netfilter: conntrack: un-export seq_print_acct · 4b216e21
      Florian Westphal 提交于
      Only one caller, just place it where its needed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4b216e21
    • F
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · d535c8a6
      Florian Westphal 提交于
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d535c8a6
    • J
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend 提交于
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
    • J
      bpf: sk_msg, fix socket data_ready events · 552de910
      John Fastabend 提交于
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      552de910
    • J
      bpf: skmsg, replace comments with BUILD bug · 7a69c0f2
      John Fastabend 提交于
      Enforce comment on structure layout dependency with a BUILD_BUG_ON
      to ensure the condition is maintained.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      7a69c0f2
  5. 20 12月, 2018 20 次提交
  6. 19 12月, 2018 1 次提交