1. 26 3月, 2018 2 次提交
    • K
      net: Drop NETDEV_UNREGISTER_FINAL · 070f2d7e
      Kirill Tkhai 提交于
      Last user is gone after bdf5bd7f "rds: tcp: remove
      register_netdevice_notifier infrastructure.", so we can
      remove this netdevice command. This allows to delete
      rtnl_lock() in netdev_run_todo(), which is hot path for
      net namespace unregistration.
      
      dev_change_net_namespace() and netdev_wait_allrefs()
      have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
      and the source commits say they were introduced to
      delemit the call with NETDEV_UNREGISTER, but this patch
      leaves them on the places, since they require additional
      analysis, whether we need in them for something else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      070f2d7e
    • K
      net: Make NETDEV_XXX commands enum { } · ede2762d
      Kirill Tkhai 提交于
      This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
      Since the cmd is used in usnic_ib_netdev_event_to_string()
      to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
      from everywhere, we'd have holes in event2str[] in this
      function.
      
      Instead of that, let's make NETDEV_XXX commands names
      available for everyone, and to define netdev_cmd_to_name()
      in the way we won't have to shaffle names after their
      numbers are changed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede2762d
  2. 23 3月, 2018 1 次提交
    • D
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek 提交于
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
  3. 22 3月, 2018 1 次提交
  4. 20 3月, 2018 3 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG · 312fc2b4
      John Fastabend 提交于
      When calling do_tcp_sendpages() from in kernel and we know the data
      has no references from user side we can omit SKBTX_SHARED_FRAG flag.
      This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
      to omit setting SKBTX_SHARED_FRAG.
      
      The flag is not exposed to userspace because the sendpage call from
      the splice logic masks out all bits except MSG_MORE.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      312fc2b4
    • T
      percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods · b3a5d111
      Tejun Heo 提交于
      percpu_ref internally uses sched-RCU to implement the percpu -> atomic
      mode switching and the documentation suggested that this could be
      depended upon.  This doesn't seem like a good idea.
      
      * percpu_ref uses sched-RCU which has different grace periods regular
        RCU.  Users may combine percpu_ref with regular RCU usage and
        incorrectly believe that regular RCU grace periods are performed by
        percpu_ref.  This can lead to, for example, use-after-free due to
        premature freeing.
      
      * percpu_ref has a grace period when switching from percpu to atomic
        mode.  It doesn't have one between the last put and release.  This
        distinction is subtle and can lead to surprising bugs.
      
      * percpu_ref allows starting in and switching to atomic mode manually
        for debugging and other purposes.  This means that there may not be
        any grace periods from kill to release.
      
      This patch makes it clear that the grace periods are percpu_ref's
      internal implementation detail and can't be depended upon by the
      users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b3a5d111
  5. 17 3月, 2018 1 次提交
    • K
      net: Add rtnl_lock_killable() · 79ffdfc6
      Kirill Tkhai 提交于
      rtnl_lock() is widely used mutex in kernel. Some of kernel code
      does memory allocations under it. In case of memory deficit this
      may invoke OOM killer, but the problem is a killed task can't
      exit if it's waiting for the mutex. This may be a reason of deadlock
      and panic.
      
      This patch adds a new primitive, which responds on SIGKILL, and
      it allows to use it in the places, where we don't want to sleep
      forever.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79ffdfc6
  6. 16 3月, 2018 2 次提交
    • T
      vlan: Fix out of order vlan headers with reorder header off · cbe7128c
      Toshiaki Makita 提交于
      With reorder header off, received packets are untagged in skb_vlan_untag()
      called from within __netif_receive_skb_core(), and later the tag will be
      inserted back in vlan_do_receive().
      
      This caused out of order vlan headers when we create a vlan device on top
      of another vlan device, because vlan_do_receive() inserts a tag as the
      outermost vlan tag. E.g. the outer tag is first removed in skb_vlan_untag()
      and inserted back in vlan_do_receive(), then the inner tag is next removed
      and inserted back as the outermost tag.
      
      This patch fixes the behaviour by inserting the inner tag at the right
      position.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbe7128c
    • E
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman 提交于
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  7. 15 3月, 2018 2 次提交
    • M
      KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid · 16ca6a60
      Marc Zyngier 提交于
      The vgic code is trying to be clever when injecting GICv2 SGIs,
      and will happily populate LRs with the same interrupt number if
      they come from multiple vcpus (after all, they are distinct
      interrupt sources).
      
      Unfortunately, this is against the letter of the architecture,
      and the GICv2 architecture spec says "Each valid interrupt stored
      in the List registers must have a unique VirtualID for that
      virtual CPU interface.". GICv3 has similar (although slightly
      ambiguous) restrictions.
      
      This results in guests locking up when using GICv2-on-GICv3, for
      example. The obvious fix is to stop trying so hard, and inject
      a single vcpu per SGI per guest entry. After all, pending SGIs
      with multiple source vcpus are pretty rare, and are mostly seen
      in scenario where the physical CPUs are severely overcomitted.
      
      But as we now only inject a single instance of a multi-source SGI per
      vcpu entry, we may delay those interrupts for longer than strictly
      necessary, and run the risk of injecting lower priority interrupts
      in the meantime.
      
      In order to address this, we adopt a three stage strategy:
      - If we encounter a multi-source SGI in the AP list while computing
        its depth, we force the list to be sorted
      - When populating the LRs, we prevent the injection of any interrupt
        of lower priority than that of the first multi-source SGI we've
        injected.
      - Finally, the injection of a multi-source SGI triggers the request
        of a maintenance interrupt when there will be no pending interrupt
        in the LRs (HCR_NPIE).
      
      At the point where the last pending interrupt in the LRs switches
      from Pending to Active, the maintenance interrupt will be delivered,
      allowing us to add the remaining SGIs using the same process.
      
      Cc: stable@vger.kernel.org
      Fixes: 0919e84c ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework")
      Acked-by: NChristoffer Dall <cdall@kernel.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      16ca6a60
    • E
      net: use skb_to_full_sk() in skb_update_prio() · 4dcb31d4
      Eric Dumazet 提交于
      Andrei Vagin reported a KASAN: slab-out-of-bounds error in
      skb_update_prio()
      
      Since SYNACK might be attached to a request socket, we need to
      get back to the listener socket.
      Since this listener is manipulated without locks, add const
      qualifiers to sock_cgroup_prioidx() so that the const can also
      be used in skb_update_prio()
      
      Also add the const qualifier to sock_cgroup_classid() for consistency.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dcb31d4
  8. 14 3月, 2018 3 次提交
  9. 12 3月, 2018 3 次提交
    • X
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long 提交于
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NPhil Sutter <phil@nwl.cc>
      Acked-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
    • B
      net: phy: Tell caller result of phy_change() · a2c054a8
      Brad Mouring 提交于
      In 664fcf12 (net: phy: Threaded interrupts allow some simplification)
      the phy_interrupt system was changed to use a traditional threaded
      interrupt scheme instead of a workqueue approach.
      
      With this change, the phy status check moved into phy_change, which
      did not report back to the caller whether or not the interrupt was
      handled. This means that, in the case of a shared phy interrupt,
      only the first phydev's interrupt registers are checked (since
      phy_interrupt() would always return IRQ_HANDLED). This leads to
      interrupt storms when it is a secondary device that's actually the
      interrupt source.
      Signed-off-by: NBrad Mouring <brad.mouring@ni.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2c054a8
    • F
      netfilter: x_tables: add and use xt_check_proc_name · b1d0a5d0
      Florian Westphal 提交于
      recent and hashlimit both create /proc files, but only check that
      name is 0 terminated.
      
      This can trigger WARN() from procfs when name is "" or "/".
      Add helper for this and then use it for both.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: <syzbot+0502b00edac2a0680b61@syzkaller.appspotmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b1d0a5d0
  10. 10 3月, 2018 3 次提交
    • P
      net: introduce IFF_NO_RX_HANDLER · f5426250
      Paolo Abeni 提交于
      Some network devices - notably ipvlan slave - are not compatible with
      any kind of rx_handler. Currently the hook can be installed but any
      configuration (bridge, bond, macsec, ...) is nonfunctional.
      
      This change allocates a priv_flag bit to mark such devices and explicitly
      forbid installing a rx_handler if such bit is set. The new bit is used
      by ipvlan slave device.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5426250
    • J
      vhost_net: examine pointer types during un-producing · 3a403076
      Jason Wang 提交于
      After commit fc72d1d5 ("tuntap: XDP transmission"), we can
      actually queueing XDP pointers in the pointer ring, so we should
      examine the pointer type before freeing the pointer.
      
      Fixes: fc72d1d5 ("tuntap: XDP transmission")
      Reported-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a403076
    • E
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet 提交于
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79134e6c
  11. 09 3月, 2018 1 次提交
    • E
      net: ethtool: extend RXNFC API to support RSS spreading of filter matches · 84a1d9c4
      Edward Cree 提交于
      We use a two-step process to configure a filter with RSS spreading.  First,
       the RSS context is allocated and configured using ETHTOOL_SRSSH; this
       returns an identifier (rss_context) which can then be passed to subsequent
       invocations of ETHTOOL_SRXCLSRLINS to specify that the offset from the RSS
       indirection table lookup should be added to the queue number (ring_cookie)
       when delivering the packet.  Drivers for devices which can only use the
       indirection table entry directly (not add it to a base queue number)
       should reject rule insertions combining RSS with a nonzero ring_cookie.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84a1d9c4
  12. 08 3月, 2018 10 次提交
    • A
      net/mlx5: IPSec, Add support for ESN · cb010083
      Aviad Yehezkel 提交于
      Currently ESN is not supported with IPSec device offload.
      
      This patch adds ESN support to IPsec device offload.
      Implementing new xfrm device operation to synchronize offloading device
      ESN with xfrm received SN. New QP command to update SA state at the
      following:
      
                 ESN 1                    ESN 2                  ESN 3
      |-----------*-----------|-----------*-----------|-----------*
      ^           ^           ^           ^           ^           ^
      
      ^ - marks where QP command invoked to update the SA ESN state
          machine.
      | - marks the start of the ESN scope (0-2^32-1). At this point move SA
          ESN overlap bit to zero and increment ESN.
      * - marks the middle of the ESN scope (2^31). At this point move SA
          ESN overlap bit to one.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NYossef Efraim <yossefe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      cb010083
    • A
      net/mlx5: Add flow-steering commands for FPGA IPSec implementation · 05564d0a
      Aviad Yehezkel 提交于
      In order to add a context to the FPGA, we need to get both the software
      transform context (which includes the keys, etc) and the
      source/destination IPs (which are included in the steering
      rule). Therefore, we register new set of firmware like commands for
      the FPGA. Each time a rule is added, the steering core infrastructure
      calls the FPGA command layer. If the rule is intended for the FPGA,
      it combines the IPs information with the software transformation
      context and creates the respective hardware transform.
      Afterwards, it calls the standard steering command layer.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      05564d0a
    • A
      net/mlx5: Refactor accel IPSec code · d6c4f029
      Aviad Yehezkel 提交于
      The current code has one layer that executed FPGA commands and
      the Ethernet part directly used this code. Since downstream patches
      introduces support for IPSec in mlx5_ib, we need to provide some
      abstractions. This patch refactors the accel code into one layer
      that creates a software IPSec transformation and another one which
      creates the actual hardware context.
      The internal command implementation is now hidden in the FPGA
      core layer. The code also adds the ability to share FPGA hardware
      contexts. If two contexts are the same, only a reference count
      is taken.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      d6c4f029
    • A
      net/mlx5: Added required metadata capability for ipsec · af9fe19d
      Aviad Yehezkel 提交于
      Currently our device requires additional metadata in packet
      to perform ipsec crypto offload.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      af9fe19d
    • A
      net/mlx5: Export ipsec capabilities · 1d2005e2
      Aviad Yehezkel 提交于
      We will need that for ipsec verbs.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      1d2005e2
    • A
      net/mlx5: IPSec, Add command V2 support · 65802f48
      Aviad Yehezkel 提交于
      This patch adds V2 command support.
      New fpga devices support extended features (udp encap, esn etc...), this
      features require new hardware sadb format therefore we have a new version
      of commands to manipulate it.
      Signed-off-by: NYossef Efraim <yossefe@mellanox.com>
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      65802f48
    • Y
      net/mlx5e: IPSec, Add support for ESP trailer removal by hardware · 788a8210
      Yossi Kuperman 提交于
      Current hardware decrypts and authenticates incoming ESP packets.
      Subsequently, the software extracts the nexthdr field, truncates the
      trailer and adjusts csum accordingly.
      
      With this patch and a capable device, the trailer is being removed
      by the hardware and the nexthdr field is conveyed via PET. This way
      we avoid both the need to access the trailer (cache miss) and to
      compute its relative checksum, which significantly improve
      the performance.
      
      Experiment shows that trailer removal improves the performance by
      2Gbps, (netperf). Both forwarding and host-to-host configurations.
      Signed-off-by: NYossi Kuperman <yossiku@mellanox.com>
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      788a8210
    • Y
      net/mlx5: IPSec, Generalize sandbox QP commands · 581fddde
      Yossi Kuperman 提交于
      The current code assume only SA QP commands.
      Refactor in order to pave the way for new QP commands:
      1. Generic cmd response format.
      2. SA cmd checks are in dedicated functions.
      3. Aligned debug prints.
      Signed-off-by: NYossi Kuperman <yossiku@mellanox.com>
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      581fddde
    • P
      net: unpollute priv_flags space · 1ec54cb4
      Paolo Abeni 提交于
      the ipvlan device driver defines and uses 2 bits inside the priv_flags
      net_device field. Such bits and the related helper are used only
      inside the ipvlan device driver, and the core networking does not
      need to be aware of them.
      
      This change moves netif_is_ipvlan* helper in the ipvlan driver and
      re-implement them looking for ipvlan specific symbols instead of
      using priv_flags.
      
      Overall this frees two bits inside priv_flags - and move the following
      ones to avoid gaps - without any intended functional change.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ec54cb4
    • E
      net: usbnet: fix potential deadlock on 32bit hosts · 2695578b
      Eric Dumazet 提交于
      Marek reported a LOCKDEP issue occurring on 32bit host,
      that we tracked down to the fact that usbnet could either
      run from soft or hard irqs.
      
      This patch adds u64_stats_update_begin_irqsave() and
      u64_stats_update_end_irqrestore() helpers to solve this case.
      
      [   17.768040] ================================
      [   17.772239] WARNING: inconsistent lock state
      [   17.776511] 4.16.0-rc3-next-20180227-00007-g876c53a7493c #453 Not tainted
      [   17.783329] --------------------------------
      [   17.787580] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      [   17.793607] swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      [   17.798751]  (&syncp->seq#5){?.-.}, at: [<9b22e5f0>]
      asix_rx_fixup_internal+0x188/0x288
      [   17.806790] {IN-HARDIRQ-W} state was registered at:
      [   17.811677]   tx_complete+0x100/0x208
      [   17.815319]   __usb_hcd_giveback_urb+0x60/0xf0
      [   17.819770]   xhci_giveback_urb_in_irq+0xa8/0x240
      [   17.824469]   xhci_td_cleanup+0xf4/0x16c
      [   17.828367]   xhci_irq+0xe74/0x2240
      [   17.831827]   usb_hcd_irq+0x24/0x38
      [   17.835343]   __handle_irq_event_percpu+0x98/0x510
      [   17.840111]   handle_irq_event_percpu+0x1c/0x58
      [   17.844623]   handle_irq_event+0x38/0x5c
      [   17.848519]   handle_fasteoi_irq+0xa4/0x138
      [   17.852681]   generic_handle_irq+0x18/0x28
      [   17.856760]   __handle_domain_irq+0x6c/0xe4
      [   17.860941]   gic_handle_irq+0x54/0xa0
      [   17.864666]   __irq_svc+0x70/0xb0
      [   17.867964]   arch_cpu_idle+0x20/0x3c
      [   17.871578]   arch_cpu_idle+0x20/0x3c
      [   17.875190]   do_idle+0x144/0x218
      [   17.878468]   cpu_startup_entry+0x18/0x1c
      [   17.882454]   start_kernel+0x394/0x400
      [   17.886177] irq event stamp: 161912
      [   17.889616] hardirqs last  enabled at (161912): [<7bedfacf>]
      __netdev_alloc_skb+0xcc/0x140
      [   17.897893] hardirqs last disabled at (161911): [<d58261d0>]
      __netdev_alloc_skb+0x94/0x140
      [   17.904903] exynos5-hsi2c 12ca0000.i2c: tx timeout
      [   17.906116] softirqs last  enabled at (161904): [<387102ff>]
      irq_enter+0x78/0x80
      [   17.906123] softirqs last disabled at (161905): [<cf4c628e>]
      irq_exit+0x134/0x158
      [   17.925722].
      [   17.925722] other info that might help us debug this:
      [   17.933435]  Possible unsafe locking scenario:
      [   17.933435].
      [   17.940331]        CPU0
      [   17.942488]        ----
      [   17.944894]   lock(&syncp->seq#5);
      [   17.948274]   <Interrupt>
      [   17.950847]     lock(&syncp->seq#5);
      [   17.954386].
      [   17.954386]  *** DEADLOCK ***
      [   17.954386].
      [   17.962422] no locks held by swapper/0/0.
      
      Fixes: c8b5d129 ("net: usbnet: support 64bit stats")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2695578b
  13. 07 3月, 2018 5 次提交
  14. 06 3月, 2018 2 次提交
  15. 05 3月, 2018 1 次提交