1. 21 10月, 2018 1 次提交
    • J
      bpf: sk_msg program helper bpf_msg_push_data · 6fff607e
      John Fastabend 提交于
      This allows user to push data into a msg using sk_msg program types.
      The format is as follows,
      
      	bpf_msg_push_data(msg, offset, len, flags)
      
      this will insert 'len' bytes at offset 'offset'. For example to
      prepend 10 bytes at the front of the message the user can,
      
      	bpf_msg_push_data(msg, 0, 10, 0);
      
      This will invalidate data bounds so BPF user will have to then recheck
      data bounds after calling this. After this the msg size will have been
      updated and the user is free to write into the added bytes. We allow
      any offset/len as long as it is within the (data, data_end) range.
      However, a copy will be required if the ring is full and its possible
      for the helper to fail with ENOMEM or EINVAL errors which need to be
      handled by the BPF program.
      
      This can be used similar to XDP metadata to pass data between sk_msg
      layer and lower layers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6fff607e
  2. 20 10月, 2018 6 次提交
    • J
      bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079
      John Fastabend 提交于
      Before using the psock returned by sk_psock_get() when adding it to a
      sockmap we need to ensure it is actually a sockmap based psock.
      Previously we were only checking this after incrementing the reference
      counter which was an error. This resulted in a slab-out-of-bounds
      error when the psock was not actually a sockmap type.
      
      This moves the check up so the reference counter is only used
      if it is a sockmap psock.
      
      Eric reported the following KASAN BUG,
      
      BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
      Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
      
      CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
       sk_psock_get include/linux/skmsg.h:379 [inline]
       sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
       sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
       sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
       map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5032d079
    • S
      bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41
      Song Liu 提交于
      BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
      skb. This patch enables direct access of skb for these programs.
      
      Two helper functions bpf_compute_and_save_data_end() and
      bpf_restore_data_end() are introduced. There are used in
      __cgroup_bpf_run_filter_skb(), to compute proper data_end for the
      BPF program, and restore original data afterwards.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b39b5f41
    • M
      bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall · bd513cd0
      Mauricio Vasquez B 提交于
      The previous patch implemented a bpf queue/stack maps that
      provided the peek/pop/push functions.  There is not a direct
      relationship between those functions and the current maps
      syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
      this is mapped to the pop operation in the queue/stack maps
      and it is still to implement in other kind of maps.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bd513cd0
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • M
      bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE · 2ea864c5
      Mauricio Vasquez B 提交于
      ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
      used to save the value of a map.  Basically the same as
      ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
      argument.
      
      This will be used in the following patch that implements some new
      helpers that receive a pointer to be filled with a map value.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2ea864c5
    • M
      bpf: rename stack trace map operations · 14499160
      Mauricio Vasquez B 提交于
      In the following patches queue and stack maps (FIFO and LIFO
      datastructures) will be implemented.  In order to avoid confusion and
      a possible name clash rename stack_map_ops to stack_trace_map_ops
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      14499160
  3. 18 10月, 2018 1 次提交
  4. 17 10月, 2018 3 次提交
  5. 16 10月, 2018 17 次提交
    • D
      net: Enable kernel side filtering of route dumps · effe6792
      David Ahern 提交于
      Update parsing of route dump request to enable kernel side filtering.
      Allow filtering results by protocol (e.g., which routing daemon installed
      the route), route type (e.g., unicast), table id and nexthop device. These
      amount to the low hanging fruit, yet a huge improvement, for dumping
      routes.
      
      ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
      be used to look up the device index without taking a reference. From
      there filter->dev is only used during dump loops with the lock still held.
      
      Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
      have been filtered should no entries be returned.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      effe6792
    • D
      net: Plumb support for filtering ipv4 and ipv6 multicast route dumps · cb167893
      David Ahern 提交于
      Implement kernel side filtering of routes by egress device index and
      table id. If the table id is given in the filter, lookup table and
      call mr_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb167893
    • D
      ipmr: Refactor mr_rtm_dumproute · e1cedae1
      David Ahern 提交于
      Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
      mr_table_dump for dumps by specific table id.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1cedae1
    • D
      net/ipv4: Plumb support for filtering route dumps · 18a8021a
      David Ahern 提交于
      Implement kernel side filtering of routes by table id, egress device index,
      protocol and route type. If the table id is given in the filter, lookup the
      table and call fib_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a8021a
    • D
      net: Add struct for fib dump filter · 4724676d
      David Ahern 提交于
      Add struct fib_dump_filter for options on limiting which routes are
      returned in a dump request. The current list is table id, protocol,
      route type, rtm_flags and nexthop device index. struct net is needed
      to lookup the net_device from the index.
      
      Declare the filter for each route dump handler and plumb the new
      arguments from dump handlers to ip_valid_fib_dump_req.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4724676d
    • D
      netlink: Add answer_flags to netlink_callback · 22e6c58b
      David Ahern 提交于
      With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
      flag is set on a message back to the user if the data returned is
      influenced by some input attributes. Normally this can be done as
      messages are added to the skb, but if the filter results in no data
      being returned, the user could be confused as to why.
      
      This patch adds answer_flags to the netlink_callback allowing dump
      handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
      NLMSG_DONE message ensuring the flag gets back to the user.
      
      The netlink_callback space is initialized to 0 via a memset in
      __netlink_dump_start, so init of the new answer_flags is covered.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22e6c58b
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • E
      tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh · 5f6188a8
      Eric Dumazet 提交于
      In EDT design, I made the mistake of using tcp_wstamp_ns
      to store the last tcp_clock_ns() sample and to store the
      pacing virtual timer.
      
      This causes major regressions at high speed flows.
      
      Introduce tcp_clock_cache to store last tcp_clock_ns().
      This is needed because some arches have slow high-resolution
      kernel time service.
      
      tcp_wstamp_ns is only updated when a packet is sent.
      
      Note that we can remove tcp_mstamp in the future since
      tcp_mstamp is essentially tcp_clock_cache/1000, so the
      apparent socket size increase is temporary.
      
      Fixes: 9799ccb0 ("tcp: add tcp_wstamp_ns socket field")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6188a8
    • J
      net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command · 9771b8cc
      Justin.Lee1@Dell.com 提交于
      The new command (NCSI_CMD_SEND_CMD) is added to allow user space application
      to send NC-SI command to the network card.
      Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and response.
      
      The work flow is as below.
      
      Request:
      User space application
      	-> Netlink interface (msg)
      	-> new Netlink handler - ncsi_send_cmd_nl()
      	-> ncsi_xmit_cmd()
      
      Response:
      Response received - ncsi_rcv_rsp()
      	-> internal response handler - ncsi_rsp_handler_xxx()
      	-> ncsi_rsp_handler_netlink()
      	-> ncsi_send_netlink_rsp ()
      	-> Netlink interface (msg)
      	-> user space application
      
      Command timeout - ncsi_request_timeout()
      	-> ncsi_send_netlink_timeout ()
      	-> Netlink interface (msg with zero data length)
      	-> user space application
      
      Error:
      Error detected
      	-> ncsi_send_netlink_err ()
      	-> Netlink interface (err msg)
      	-> user space application
      Signed-off-by: NJustin Lee <justin.lee1@dell.com>
      Reviewed-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9771b8cc
    • M
      FDDI: defza: Support capturing outgoing SMT traffic · 9f9a742d
      Maciej W. Rozycki 提交于
      DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate
      SMT frames with adapter's firmware.  Any SMT frame received from the RMC
      via the Rx queue is queued back by the driver to the SMT Rx queue for
      the firmware to process.  Similarly the firmware uses the SMT Tx queue
      to supply the driver with SMT frames which are queued back to the Tx
      queue for the RMC to send to the ring.
      
      When a network tap is attached to an FDDI interface handled by `defza'
      any incoming SMT frames captured are queued to our usual processing of
      network data received, which in turn delivers them to any listening
      taps.
      
      However the outgoing SMT frames produced by the firmware bypass our
      network protocol stack and are therefore not delivered to taps.  This in
      turn means that taps are missing a part of network traffic sent by the
      adapter, which may make it more difficult to track down network problems
      or do general traffic analysis.
      
      Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
      a network tap is attached, with a newly-created `dev_nit_active' helper
      wrapping the usual condition used in the transmit path.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9a742d
    • M
      FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter · 61414f5e
      Maciej W. Rozycki 提交于
      Add support for the DEC FDDIcontroller 700 (DEFZA), Digital Equipment
      Corporation's first-generation FDDI network interface adapter, made for
      TURBOchannel and based on a discrete version of what eventually became
      Motorola's widely used CAMEL chipset.
      
      The CAMEL chipset is present for example in the DEC FDDIcontroller
      TURBOchannel, EISA and PCI adapters (DEFTA/DEFEA/DEFPA) that we support
      with the `defxx' driver, however the host bus interface logic and the
      firmware API are different in the DEFZA and hence a separate driver is
      required.
      
      There isn't much to say about the driver except that it works, but there
      is one peculiarity to mention.  The adapter implements two Tx/Rx queue
      pairs.
      
      Of these one pair is the usual network Tx/Rx queue pair, in this case
      used by the adapter to exchange frames with the ring, via the RMC (Ring
      Memory Controller) chip.  The Tx queue is handled directly by the RMC
      chip and resides in onboard packet memory.  The Rx queue is maintained
      via DMA in host memory by adapter's firmware copying received data
      stored by the RMC in onboard packet memory.
      
      The other pair is used to communicate SMT frames with adapter's
      firmware.  Any SMT frame received from the RMC via the Rx queue must be
      queued back by the driver to the SMT Rx queue for the firmware to
      process.  Similarly the firmware uses the SMT Tx queue to supply the
      driver with SMT frames that must be queued back to the Tx queue for the
      RMC to send to the ring.
      
      This solution was chosen because the designers ran out of PCB space and
      could not squeeze in more logic onto the board that would be required to
      handle this SMT frame traffic without the need to involve the driver, as
      with the later DEFTA/DEFEA/DEFPA adapters.
      
      Finally the driver does some Frame Control byte decoding, so to avoid
      magic numbers some macros are added to <linux/if_fddi.h>.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61414f5e
    • J
      bpf: Allow sk_lookup with IPv6 module · 8a615c6b
      Joe Stringer 提交于
      This is a more complete fix than d71019b5 ("net: core: Fix build
      with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6
      module is loaded (not just if it's compiled in).
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8a615c6b
    • J
      tls: add bpf support to sk_msg handling · d3b18ad3
      John Fastabend 提交于
      This work adds BPF sk_msg verdict program support to kTLS
      allowing BPF and kTLS to be combined together. Previously kTLS
      and sk_msg verdict programs were mutually exclusive in the
      ULP layer which created challenges for the orchestrator when
      trying to apply TCP based policy, for example. To resolve this,
      leveraging the work from previous patches that consolidates
      the use of sk_msg, we can finally enable BPF sk_msg verdict
      programs so they continue to run after the kTLS socket is
      created. No change in behavior when kTLS is not used in
      combination with BPF, the kselftest suite for kTLS also runs
      successfully.
      
      Joint work with Daniel.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d3b18ad3
    • J
      tls: replace poll implementation with read hook · 924ad65e
      John Fastabend 提交于
      Instead of re-implementing poll routine use the poll callback to
      trigger read from kTLS, we reuse the stream_memory_read callback
      which is simpler and achieves the same. This helps to align sockmap
      and kTLS so we can more easily embed BPF in kTLS.
      
      Joint work with Daniel.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      924ad65e
    • D
      tls: convert to generic sk_msg interface · d829e9c4
      Daniel Borkmann 提交于
      Convert kTLS over to make use of sk_msg interface for plaintext and
      encrypted scattergather data, so it reuses all the sk_msg helpers
      and data structure which later on in a second step enables to glue
      this to BPF.
      
      This also allows to remove quite a bit of open coded helpers which
      are covered by the sk_msg API. Recent changes in kTLs 80ece6a0
      ("tls: Remove redundant vars from tls record structure") and
      4e6d4720 ("tls: Add support for inplace records encryption")
      changed the data path handling a bit; while we've kept the latter
      optimization intact, we had to undo the former change to better
      fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
      been brought back and are linked into the sk_msg sgs. Now the kTLS
      record contains a msg_plaintext and msg_encrypted sk_msg each.
      
      In the original code, the zerocopy_from_iter() has been used out
      of TX but also RX path. For the strparser skb-based RX path,
      we've left the zerocopy_from_iter() in decrypt_internal() mostly
      untouched, meaning it has been moved into tls_setup_from_iter()
      with charging logic removed (as not used from RX). Given RX path
      is not based on sk_msg objects, we haven't pursued setting up a
      dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
      could be an option to prusue in a later step.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d829e9c4
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
    • D
      tcp, ulp: remove ulp bits from sockmap · 1243a51f
      Daniel Borkmann 提交于
      In order to prepare sockmap logic to be used in combination with kTLS
      we need to detangle it from ULP, and further split it in later commits
      into a generic API.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1243a51f
  6. 14 10月, 2018 1 次提交
  7. 13 10月, 2018 4 次提交
    • J
      netlink: replace __NLA_ENSURE implementation · 5886d932
      Johannes Berg 提交于
      We already have BUILD_BUG_ON_ZERO() which I just hadn't found
      before, so we should use it here instead of open-coding another
      implementation thereof.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5886d932
    • N
      net: bridge: add support for per-port vlan stats · 9163a0fc
      Nikolay Aleksandrov 提交于
      This patch adds an option to have per-port vlan stats instead of the
      default global stats. The option can be set only when there are no port
      vlans in the bridge since we need to allocate the stats if it is set
      when vlans are being added to ports (and respectively free them
      when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
      largest user of options. The current stats design allows us to add
      these without any changes to the fast-path, it all comes down to
      the per-vlan stats pointer which, if this option is enabled, will
      be allocated for each port vlan instead of using the global bridge-wide
      one.
      
      CC: bridge@lists.linux-foundation.org
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9163a0fc
    • D
      net: Evict neighbor entries on carrier down · 859bd2ef
      David Ahern 提交于
      When a link's carrier goes down it could be a sign of the port changing
      networks. If the new network has overlapping addresses with the old one,
      then the kernel will continue trying to use neighbor entries established
      based on the old network until the entries finally age out - meaning a
      potentially long delay with communications not working.
      
      This patch evicts neighbor entries on carrier down with the exception of
      those marked permanent. Permanent entries are managed by userspace (either
      an admin or a routing daemon such as FRR).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      859bd2ef
    • D
      net/ipv6: Add knob to skip DELROUTE message on device down · 7c6bb7d2
      David Ahern 提交于
      Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
      notifications when a device is taken down (admin down) or deleted. IPv4
      does not generate a message for routes evicted by the down or delete;
      IPv6 does. A NOS at scale really needs to avoid these messages and have
      IPv4 and IPv6 behave similarly, relying on userspace to handle link
      notifications and evict the routes.
      
      At this point existing user behavior needs to be preserved. Since
      notifications are a global action (not per app) the only way to preserve
      existing behavior and allow the messages to be skipped is to add a new
      sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
      disable the notificatioons.
      
      IPv6 route code already supports the option to skip the message (it is
      used for multipath routes for example). Besides the new sysctl we need
      to pass the skip_notify setting through the generic fib6_clean and
      fib6_walk functions to fib6_clean_node and to set skip_notify on calls
      to __ip_del_rt for the addrconf_ifdown path.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c6bb7d2
  8. 12 10月, 2018 5 次提交
  9. 11 10月, 2018 2 次提交