1. 31 3月, 2018 2 次提交
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  2. 30 3月, 2018 2 次提交
  3. 29 3月, 2018 3 次提交
  4. 24 3月, 2018 1 次提交
  5. 23 3月, 2018 13 次提交
    • K
      net: Replace ip_ra_lock with per-net mutex · d9ff3049
      Kirill Tkhai 提交于
      Since ra_chain is per-net, we may use per-net mutexes
      to protect them in ip_ra_control(). This improves
      scalability.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9ff3049
    • K
      net: Make ip_ra_chain per struct net · 5796ef75
      Kirill Tkhai 提交于
      This is optimization, which makes ip_call_ra_chain()
      iterate less sockets to find the sockets it's looking for.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5796ef75
    • K
      net: Revert "ipv4: fix a deadlock in ip_ra_control" · 128aaa98
      Kirill Tkhai 提交于
      This reverts commit 1215e51e.
      Since raw_close() is used on every RAW socket destruction,
      the changes made by 1215e51e scale sadly. This clearly
      seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
      kwork spends a lot of time waiting for rtnl_lock() introduced
      by this commit.
      
      Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
      so we revert this patch.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      128aaa98
    • K
      net: Move IP_ROUTER_ALERT out of lock_sock(sk) · 0526947f
      Kirill Tkhai 提交于
      ip_ra_control() does not need sk_lock. Who are the another
      users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
      sk_lock, while parallel IP_ROUTER_ALERT syscalls are
      synchronized by ip_ra_lock. So, we may move this command
      out of sk_lock.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0526947f
    • K
      net: Revert "ipv4: get rid of ip_ra_lock" · 76d3e153
      Kirill Tkhai 提交于
      This reverts commit ba3f571d. The commit was made
      after 1215e51e "ipv4: fix a deadlock in ip_ra_control",
      and killed ip_ra_lock, which became useless after rtnl_lock()
      made used to destroy every raw ipv4 socket. This scales
      very bad, and next patch in series reverts 1215e51e.
      ip_ra_lock will be used again.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76d3e153
    • C
      gre: fix TUNNEL_SEQ bit check on sequence numbering · 15746394
      Colin Ian King 提交于
      The current logic of flags | TUNNEL_SEQ is always non-zero and hence
      sequence numbers are always incremented no matter the setting of the
      TUNNEL_SEQ bit.  Fix this by using & instead of |.
      
      Detected by CoverityScan, CID#1466039 ("Operands don't affect result")
      
      Fixes: 77a5196a ("gre: add sequence number for collect md mode.")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15746394
    • G
      tipc: step sk->sk_drops when rcv buffer is full · 872619d8
      GhantaKrishnamurthy MohanKrishna 提交于
      Currently when tipc is unable to queue a received message on a
      socket, the message is rejected back to the sender with error
      TIPC_ERR_OVERLOAD. However, the application on this socket
      has no knowledge about these discards.
      
      In this commit, we try to step the sk_drops counter when tipc
      is unable to queue a received message. Export sk_drops
      using tipc socket diagnostics.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      872619d8
    • G
      tipc: implement socket diagnostics for AF_TIPC · c30b70de
      GhantaKrishnamurthy MohanKrishna 提交于
      This commit adds socket diagnostics capability for AF_TIPC in netlink
      family NETLINK_SOCK_DIAG in a new kernel module (diag.ko).
      
      The following are key design considerations:
      - config TIPC_DIAG has default y, like INET_DIAG.
      - only requests with flag NLM_F_DUMP is supported (dump all).
      - tipc_sock_diag_req message is introduced to send filter parameters.
      - the response attributes are of TLV, some nested.
      
      To avoid exposing data structures between diag and tipc modules and
      avoid code duplication, the following additions are required:
      - export tipc_nl_sk_walk function to reuse socket iterator.
      - export tipc_sk_fill_sock_diag to fill the tipc diag attributes.
      - create a sock_diag response message in __tipc_add_sock_diag defined
        in diag.c and use the above exported tipc_sk_fill_sock_diag
        to fill response.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c30b70de
    • G
      tipc: modify socket iterator for sock_diag · dfde331e
      GhantaKrishnamurthy MohanKrishna 提交于
      The current socket iterator function tipc_nl_sk_dump, handles socket
      locks and calls __tipc_nl_add_sk for each socket.
      To reuse this logic in sock_diag implementation, we do minor
      modifications to make these functions generic as described below.
      
      In this commit, we add a two new functions __tipc_nl_sk_walk,
      __tipc_nl_add_sk_info and modify tipc_nl_sk_dump, __tipc_nl_add_sk
      accordingly.
      
      In __tipc_nl_sk_walk we:
      1. acquire and release socket locks
      2. for each socket, execute the specified callback function
      
      In __tipc_nl_add_sk we:
      - Move the netlink attribute insertion to __tipc_nl_add_sk_info.
      
      tipc_nl_sk_dump calls tipc_nl_sk_walk with __tipc_nl_add_sk as argument.
      
      sock_diag will use these generic functions in a later commit.
      
      There is no functional change in this commit.
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfde331e
    • D
      devlink: Remove top_hierarchy arg to devlink_resource_register · 14530746
      David Ahern 提交于
      top_hierarchy arg can be determined by comparing parent_resource_id to
      DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate
      argument.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14530746
    • D
      net/ipv6: Handle onlink flag with multipath routes · 68e2ffde
      David Ahern 提交于
      For multipath routes the ONLINK flag can be specified per nexthop in
      rtnh_flags or globally in rtm_flags. Update ip6_route_multipath_add
      to consider the ONLINK setting coming from rtnh_flags. Each loop over
      nexthops the config for the sibling route is initialized to the global
      config and then per nexthop settings overlayed. The flag is 'or'ed into
      fib6_config to handle the ONLINK flag coming from either rtm_flags or
      rtnh_flags.
      
      Fixes: fc1e64e1 ("net/ipv6: Add support for onlink flag")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68e2ffde
    • D
      ipv6: sr: fix NULL pointer dereference when setting encap source address · 8936ef76
      David Lebrun 提交于
      When using seg6 in encap mode, we call ipv6_dev_get_saddr() to set the
      source address of the outer IPv6 header, in case none was specified.
      Using skb->dev can lead to BUG() when it is in an inconsistent state.
      This patch uses the net_device attached to the skb's dst instead.
      
      [940807.667429] BUG: unable to handle kernel NULL pointer dereference at 000000000000047c
      [940807.762427] IP: ipv6_dev_get_saddr+0x8b/0x1d0
      [940807.815725] PGD 0 P4D 0
      [940807.847173] Oops: 0000 [#1] SMP PTI
      [940807.890073] Modules linked in:
      [940807.927765] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G        W        4.16.0-rc1-seg6bpf+ #2
      [940808.028988] Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26    09/06/2010
      [940808.128128] RIP: 0010:ipv6_dev_get_saddr+0x8b/0x1d0
      [940808.187667] RSP: 0018:ffff88043fd836b0 EFLAGS: 00010206
      [940808.251366] RAX: 0000000000000005 RBX: ffff88042cb1c860 RCX: 00000000000000fe
      [940808.338025] RDX: 00000000000002c0 RSI: ffff88042cb1c860 RDI: 0000000000004500
      [940808.424683] RBP: ffff88043fd83740 R08: 0000000000000000 R09: ffffffffffffffff
      [940808.511342] R10: 0000000000000040 R11: 0000000000000000 R12: ffff88042cb1c850
      [940808.598012] R13: ffffffff8208e380 R14: ffff88042ac8da00 R15: 0000000000000002
      [940808.684675] FS:  0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000
      [940808.783036] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [940808.852975] CR2: 000000000000047c CR3: 00000004255fe000 CR4: 00000000000006e0
      [940808.939634] Call Trace:
      [940808.970041]  <IRQ>
      [940808.995250]  ? ip6t_do_table+0x265/0x640
      [940809.043341]  seg6_do_srh_encap+0x28f/0x300
      [940809.093516]  ? seg6_do_srh+0x1a0/0x210
      [940809.139528]  seg6_do_srh+0x1a0/0x210
      [940809.183462]  seg6_output+0x28/0x1e0
      [940809.226358]  lwtunnel_output+0x3f/0x70
      [940809.272370]  ip6_xmit+0x2b8/0x530
      [940809.313185]  ? ac6_proc_exit+0x20/0x20
      [940809.359197]  inet6_csk_xmit+0x7d/0xc0
      [940809.404173]  tcp_transmit_skb+0x548/0x9a0
      [940809.453304]  __tcp_retransmit_skb+0x1a8/0x7a0
      [940809.506603]  ? ip6_default_advmss+0x40/0x40
      [940809.557824]  ? tcp_current_mss+0x24/0x90
      [940809.605925]  tcp_retransmit_skb+0xd/0x80
      [940809.654016]  tcp_xmit_retransmit_queue.part.17+0xf9/0x210
      [940809.719797]  tcp_ack+0xa47/0x1110
      [940809.760612]  tcp_rcv_established+0x13c/0x570
      [940809.812865]  tcp_v6_do_rcv+0x151/0x3d0
      [940809.858879]  tcp_v6_rcv+0xa5c/0xb10
      [940809.901770]  ? seg6_output+0xdd/0x1e0
      [940809.946745]  ip6_input_finish+0xbb/0x460
      [940809.994837]  ip6_input+0x74/0x80
      [940810.034612]  ? ip6_rcv_finish+0xb0/0xb0
      [940810.081663]  ipv6_rcv+0x31c/0x4c0
      ...
      
      Fixes: 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
      Reported-by: NTom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8936ef76
    • D
      ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state · 191f86ca
      David Lebrun 提交于
      The seg6_build_state() function is called with RCU read lock held,
      so we cannot use GFP_KERNEL. This patch uses GFP_ATOMIC instead.
      
      [   92.770271] =============================
      [   92.770628] WARNING: suspicious RCU usage
      [   92.770921] 4.16.0-rc4+ #12 Not tainted
      [   92.771277] -----------------------------
      [   92.771585] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section!
      [   92.772279]
      [   92.772279] other info that might help us debug this:
      [   92.772279]
      [   92.773067]
      [   92.773067] rcu_scheduler_active = 2, debug_locks = 1
      [   92.773514] 2 locks held by ip/2413:
      [   92.773765]  #0:  (rtnl_mutex){+.+.}, at: [<00000000e5461720>] rtnetlink_rcv_msg+0x441/0x4d0
      [   92.774377]  #1:  (rcu_read_lock){....}, at: [<00000000df4f161e>] lwtunnel_build_state+0x59/0x210
      [   92.775065]
      [   92.775065] stack backtrace:
      [   92.775371] CPU: 0 PID: 2413 Comm: ip Not tainted 4.16.0-rc4+ #12
      [   92.775791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014
      [   92.776608] Call Trace:
      [   92.776852]  dump_stack+0x7d/0xbc
      [   92.777130]  __schedule+0x133/0xf00
      [   92.777393]  ? unwind_get_return_address_ptr+0x50/0x50
      [   92.777783]  ? __sched_text_start+0x8/0x8
      [   92.778073]  ? rcu_is_watching+0x19/0x30
      [   92.778383]  ? kernel_text_address+0x49/0x60
      [   92.778800]  ? __kernel_text_address+0x9/0x30
      [   92.779241]  ? unwind_get_return_address+0x29/0x40
      [   92.779727]  ? pcpu_alloc+0x102/0x8f0
      [   92.780101]  _cond_resched+0x23/0x50
      [   92.780459]  __mutex_lock+0xbd/0xad0
      [   92.780818]  ? pcpu_alloc+0x102/0x8f0
      [   92.781194]  ? seg6_build_state+0x11d/0x240
      [   92.781611]  ? save_stack+0x9b/0xb0
      [   92.781965]  ? __ww_mutex_wakeup_for_backoff+0xf0/0xf0
      [   92.782480]  ? seg6_build_state+0x11d/0x240
      [   92.782925]  ? lwtunnel_build_state+0x1bd/0x210
      [   92.783393]  ? ip6_route_info_create+0x687/0x1640
      [   92.783846]  ? ip6_route_add+0x74/0x110
      [   92.784236]  ? inet6_rtm_newroute+0x8a/0xd0
      
      Fixes: 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
      Signed-off-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      191f86ca
  6. 22 3月, 2018 11 次提交
    • S
      rds: tcp: remove register_netdevice_notifier infrastructure. · bdf5bd7f
      Sowmini Varadhan 提交于
      The netns deletion path does not need to wait for all net_devices
      to be unregistered before dismantling rds_tcp state for the netns
      (we are able to dismantle this state on module unload even when
      all net_devices are active so there is no dependency here).
      
      This patch removes code related to netdevice notifiers and
      refactors all the code needed to dismantle rds_tcp state
      into a ->exit callback for the pernet_operations used with
      register_pernet_device().
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf5bd7f
    • K
      net: Convert nf_ct_net_ops · aa65f636
      Kirill Tkhai 提交于
      These pernet_operations register and unregister sysctl.
      Also, there is inet_frags_exit_net() called in exit method,
      which has to be safe after a5600024 "net: Fix hlist
      corruptions in inet_evict_bucket()".
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa65f636
    • K
      net: Convert lowpan_frags_ops · 08012631
      Kirill Tkhai 提交于
      These pernet_operations register and unregister sysctl.
      Also, there is inet_frags_exit_net() called in exit method,
      which has to be safe after a5600024 "net: Fix hlist
      corruptions in inet_evict_bucket()".
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08012631
    • K
      net: Convert can_pernet_ops · 1ae77627
      Kirill Tkhai 提交于
      These pernet_operations create and destroy /proc entries
      and cancel per-net timer.
      
      Also, there are unneed iterations over empty list of net
      devices, since all net devices must be already moved
      to init_net or unregistered by default_device_ops. This
      already was mentioned here:
      
      https://marc.info/?l=linux-can&m=150169589119335&w=2
      
      So, it looks safe to make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ae77627
    • D
      net/sched: fix idr leak in the error path of tcf_skbmod_init() · f29cdfbe
      Davide Caratti 提交于
      tcf_skbmod_init() can fail after the idr has been successfully reserved.
      When this happens, every subsequent attempt to configure skbmod rules
      using the same idr value will systematically fail with -ENOSPC, unless
      the first attempt was done using the 'replace' keyword:
      
       # tc action add action skbmod swap mac index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action add action skbmod swap mac index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # tc action add action skbmod swap mac index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       ...
      
      Fix this in tcf_skbmod_init(), ensuring that tcf_idr_release() is called
      on the error path when the idr has been reserved, but not yet inserted.
      Also, don't test 'ovr' in the error path, to avoid a 'replace' failure
      implicitly become a 'delete' that leaks refcount in act_skbmod module:
      
       # rmmod act_skbmod; modprobe act_skbmod
       # tc action add action skbmod swap mac index 100
       # tc action add action skbmod swap mac continue index 100
       RTNETLINK answers: File exists
       We have an error talking to the kernel
       # tc action replace action skbmod swap mac continue index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action list action skbmod
       #
       # rmmod  act_skbmod
       rmmod: ERROR: Module act_skbmod is in use
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f29cdfbe
    • D
      net/sched: fix idr leak in the error path of tcf_vlan_init() · d7f20015
      Davide Caratti 提交于
      tcf_vlan_init() can fail after the idr has been successfully reserved.
      When this happens, every subsequent attempt to configure vlan rules using
      the same idr value will systematically fail with -ENOSPC, unless the first
      attempt was done using the 'replace' keyword.
      
       # tc action add action vlan pop index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action add action vlan pop index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # tc action add action vlan pop index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       ...
      
      Fix this in tcf_vlan_init(), ensuring that tcf_idr_release() is called on
      the error path when the idr has been reserved, but not yet inserted. Also,
      don't test 'ovr' in the error path, to avoid a 'replace' failure implicitly
      become a 'delete' that leaks refcount in act_vlan module:
      
       # rmmod act_vlan; modprobe act_vlan
       # tc action add action vlan push id 5 index 100
       # tc action replace action vlan push id 7 index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action list action vlan
       #
       # rmmod act_vlan
       rmmod: ERROR: Module act_vlan is in use
      
      Fixes: 4c5b9d96 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7f20015
    • D
      net/sched: fix idr leak in the error path of __tcf_ipt_init() · 1e46ef17
      Davide Caratti 提交于
      __tcf_ipt_init() can fail after the idr has been successfully reserved.
      When this happens, subsequent attempts to configure xt/ipt rules using
      the same idr value systematically fail with -ENOSPC:
      
       # tc action add action xt -j LOG --log-prefix test1 index 100
       tablename: mangle hook: NF_IP_POST_ROUTING
               target:  LOG level warning prefix "test1" index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       Command "(null)" is unknown, try "tc actions help".
       # tc action add action xt -j LOG --log-prefix test1 index 100
       tablename: mangle hook: NF_IP_POST_ROUTING
               target:  LOG level warning prefix "test1" index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       Command "(null)" is unknown, try "tc actions help".
       # tc action add action xt -j LOG --log-prefix test1 index 100
       tablename: mangle hook: NF_IP_POST_ROUTING
               target:  LOG level warning prefix "test1" index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       ...
      
      Fix this in the error path of __tcf_ipt_init(), calling tcf_idr_release()
      in place of tcf_idr_cleanup(). Since tcf_ipt_release() can now be called
      when tcfi_t is NULL, we also need to protect calls to ipt_destroy_target()
      to avoid NULL pointer dereference.
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e46ef17
    • D
      net/sched: fix idr leak in the error path of tcp_pedit_init() · 94fa3f92
      Davide Caratti 提交于
      tcf_pedit_init() can fail to allocate 'keys' after the idr has been
      successfully reserved. When this happens, subsequent attempts to configure
      a pedit rule using the same idr value systematically fail with -ENOSPC:
      
       # tc action add action pedit munge ip ttl set 63 index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action add action pedit munge ip ttl set 63 index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # tc action add action pedit munge ip ttl set 63 index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       ...
      
      Fix this in the error path of tcf_act_pedit_init(), calling
      tcf_idr_release() in place of tcf_idr_cleanup().
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94fa3f92
    • D
      net/sched: fix idr leak in the error path of tcf_act_police_init() · 5bf7f818
      Davide Caratti 提交于
      tcf_act_police_init() can fail after the idr has been successfully
      reserved (e.g., qdisc_get_rtab() may return NULL). When this happens,
      subsequent attempts to configure a police rule using the same idr value
      systematiclly fail with -ENOSPC:
      
       # tc action add action police rate 1000 burst 1000 drop index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc action add action police rate 1000 burst 1000 drop index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # tc action add action police rate 1000 burst 1000 drop index 100
       RTNETLINK answers: No space left on device
       ...
      
      Fix this in the error path of tcf_act_police_init(), calling
      tcf_idr_release() in place of tcf_idr_cleanup().
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bf7f818
    • D
      net/sched: fix idr leak in the error path of tcf_simp_init() · 60e10b3a
      Davide Caratti 提交于
      if the kernel fails to duplicate 'sdata', creation of a new action fails
      with -ENOMEM. However, subsequent attempts to install the same action
      using the same value of 'index' systematically fail with -ENOSPC, and
      that value of 'index' will no more be usable by act_simple, until rmmod /
      insmod of act_simple.ko is done:
      
       # tc actions add action simple sdata hello index 100
       # tc actions list action simple
      
              action order 0: Simple <hello>
               index 100 ref 1 bind 0
       # tc actions flush action simple
       # tc actions add action simple sdata hello index 100
       RTNETLINK answers: Cannot allocate memory
       We have an error talking to the kernel
       # tc actions flush action simple
       # tc actions add action simple sdata hello index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       # tc actions add action simple sdata hello index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
       ...
      
      Fix this in the error path of tcf_simp_init(), calling tcf_idr_release()
      in place of tcf_idr_cleanup().
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60e10b3a
    • D
      net/sched: fix idr leak on the error path of tcf_bpf_init() · bbc09e78
      Davide Caratti 提交于
      when the following command sequence is entered
      
       # tc action add action bpf bytecode '4,40 0 0 12,31 0 1 2048,6 0 0 262144,6 0 0 0' index 100
       RTNETLINK answers: Invalid argument
       We have an error talking to the kernel
       # tc action add action bpf bytecode '4,40 0 0 12,21 0 1 2048,6 0 0 262144,6 0 0 0' index 100
       RTNETLINK answers: No space left on device
       We have an error talking to the kernel
      
      act_bpf correctly refuses to install the first TC rule, because 31 is not
      a valid instruction. However, it refuses to install the second TC rule,
      even if the BPF code is correct. Furthermore, it's no more possible to
      install any other rule having the same value of 'index' until act_bpf
      module is unloaded/inserted again. After the idr has been reserved, call
      tcf_idr_release() instead of tcf_idr_cleanup(), to fix this issue.
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbc09e78
  7. 21 3月, 2018 2 次提交
  8. 20 3月, 2018 6 次提交
    • A
      devlink: Remove redundant free on error path · 7fe4d6dc
      Arkadi Sharshevsky 提交于
      The current code performs unneeded free. Remove the redundant skb freeing
      during the error path.
      
      Fixes: 1555d204 ("devlink: Support for pipeline debug (dpipe)")
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fe4d6dc
    • J
      bpf: sk_msg program helper bpf_sk_msg_pull_data · 015632bb
      John Fastabend 提交于
      Currently, if a bpf sk msg program is run the program
      can only parse data that the (start,end) pointers already
      consumed. For sendmsg hooks this is likely the first
      scatterlist element. For sendpage this will be the range
      (0,0) because the data is shared with userspace and by
      default we want to avoid allowing userspace to modify
      data while (or after) BPF verdict is being decided.
      
      To support pulling in additional bytes for parsing use
      a new helper bpf_sk_msg_pull(start, end, flags) which
      works similar to cls tc logic. This helper will attempt
      to point the data start pointer at 'start' bytes offest
      into msg and data end pointer at 'end' bytes offset into
      message.
      
      After basic sanity checks to ensure 'start' <= 'end' and
      'end' <= msg_length there are a few cases we need to
      handle.
      
      First the sendmsg hook has already copied the data from
      userspace and has exclusive access to it. Therefor, it
      is not necessesary to copy the data. However, it may
      be required. After finding the scatterlist element with
      'start' offset byte in it there are two cases. One the
      range (start,end) is entirely contained in the sg element
      and is already linear. All that is needed is to update the
      data pointers, no allocate/copy is needed. The other case
      is (start, end) crosses sg element boundaries. In this
      case we allocate a block of size 'end - start' and copy
      the data to linearize it.
      
      Next sendpage hook has not copied any data in initial
      state so that data pointers are (0,0). In this case we
      handle it similar to the above sendmsg case except the
      allocation/copy must always happen. Then when sending
      the data we have possibly three memory regions that
      need to be sent, (0, start - 1), (start, end), and
      (end + 1, msg_length). This is required to ensure any
      writes by the BPF program are correctly transmitted.
      
      Lastly this operation will invalidate any previous
      data checks so BPF programs will have to revalidate
      pointers after making this BPF call.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      015632bb
    • J
      bpf: sockmap, add msg_cork_bytes() helper · 91843d54
      John Fastabend 提交于
      In the case where we need a specific number of bytes before a
      verdict can be assigned, even if the data spans multiple sendmsg
      or sendfile calls. The BPF program may use msg_cork_bytes().
      
      The extreme case is a user can call sendmsg repeatedly with
      1-byte msg segments. Obviously, this is bad for performance but
      is still valid. If the BPF program needs N bytes to validate
      a header it can use msg_cork_bytes to specify N bytes and the
      BPF program will not be called again until N bytes have been
      accumulated. The infrastructure will attempt to coalesce data
      if possible so in many cases (most my use cases at least) the
      data will be in a single scatterlist element with data pointers
      pointing to start/end of the element. However, this is dependent
      on available memory so is not guaranteed. So BPF programs must
      validate data pointer ranges, but this is the case anyways to
      convince the verifier the accesses are valid.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      91843d54
    • J
      bpf: sockmap, add bpf_msg_apply_bytes() helper · 2a100317
      John Fastabend 提交于
      A single sendmsg or sendfile system call can contain multiple logical
      messages that a BPF program may want to read and apply a verdict. But,
      without an apply_bytes helper any verdict on the data applies to all
      bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
      care to read the first N bytes of a msg. If the payload is large say
      MB or even GB setting up and calling the BPF program repeatedly for
      all bytes, even though the verdict is already known, creates
      unnecessary overhead.
      
      To allow BPF programs to control how many bytes a given verdict
      applies to we implement a bpf_msg_apply_bytes() helper. When called
      from within a BPF program this sets a counter, internal to the
      BPF infrastructure, that applies the last verdict to the next N
      bytes. If the N is smaller than the current data being processed
      from a sendmsg/sendfile call, the first N bytes will be sent and
      the BPF program will be re-run with start_data pointing to the N+1
      byte. If N is larger than the current data being processed the
      BPF verdict will be applied to multiple sendmsg/sendfile calls
      until N bytes are consumed.
      
      Note1 if a socket closes with apply_bytes counter non-zero this
      is not a problem because data is not being buffered for N bytes
      and is sent as its received.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2a100317
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
    • J
      net: generalize sk_alloc_sg to work with scatterlist rings · 8c05dbf0
      John Fastabend 提交于
      The current implementation of sk_alloc_sg expects scatterlist to always
      start at entry 0 and complete at entry MAX_SKB_FRAGS.
      
      Future patches will want to support starting at arbitrary offset into
      scatterlist so add an additional sg_start parameters and then default
      to the current values in TLS code paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8c05dbf0