1. 11 5月, 2018 1 次提交
  2. 08 5月, 2018 2 次提交
  3. 03 5月, 2018 1 次提交
  4. 02 5月, 2018 2 次提交
    • T
      ipv6: Allow non-gateway ECMP for IPv6 · edd7ceb7
      Thomas Winter 提交于
      It is valid to have static routes where the nexthop
      is an interface not an address such as tunnels.
      For IPv4 it was possible to use ECMP on these routes
      but not for IPv6.
      Signed-off-by: NThomas Winter <Thomas.Winter@alliedtelesis.co.nz>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edd7ceb7
    • E
      ipv6: fix uninit-value in ip6_multipath_l3_keys() · cea67a2d
      Eric Dumazet 提交于
      syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(),
      root caused to a bad assumption of ICMP header being already
      pulled in skb->head
      
      ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug.
      
      BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
      BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
      CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
       ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
       rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
       ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884
       ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208
       __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562
       __netif_receive_skb net/core/dev.c:4627 [inline]
       netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
       netif_receive_skb+0x230/0x240 net/core/dev.c:4725
       tun_rx_batched drivers/net/tun.c:1555 [inline]
       tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962
       tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
       call_write_iter include/linux/fs.h:1782 [inline]
       new_sync_write fs/read_write.c:469 [inline]
       __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
       vfs_write+0x463/0x8d0 fs/read_write.c:544
       SYSC_write+0x172/0x360 fs/read_write.c:589
       SyS_write+0x55/0x80 fs/read_write.c:581
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 23aebdac ("ipv6: Compute multipath hash for ICMP errors from offending packet")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jakub Sitnicki <jkbs@redhat.com>
      Acked-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea67a2d
  5. 27 4月, 2018 1 次提交
  6. 24 4月, 2018 1 次提交
  7. 23 4月, 2018 1 次提交
    • A
      ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts · a957fa19
      Ahmed Abdelsalam 提交于
      In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
      in order to set the src addr of outer IPv6 header.
      
      The net_device is required for set_tun_src(). However calling ip6_dst_idev()
      on dst_entry in case of IPv4 traffic results on the following bug.
      
      Using just dst->dev should fix this BUG.
      
      [  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
      [  196.243329] Oops: 0000 [#1] SMP PTI
      [  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
      [  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
      [  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
      [  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
      [  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
      [  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
      [  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
      [  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
      [  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
      [  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
      [  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
      [  196.247804] Call Trace:
      [  196.247972]  seg6_do_srh+0x15b/0x1c0
      [  196.248156]  seg6_output+0x3c/0x220
      [  196.248341]  ? prandom_u32+0x14/0x20
      [  196.248526]  ? ip_idents_reserve+0x6c/0x80
      [  196.248723]  ? __ip_select_ident+0x90/0x100
      [  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
      [  196.249133]  lwtunnel_output+0x44/0x70
      [  196.249328]  ip_send_skb+0x15/0x40
      [  196.249515]  raw_sendmsg+0x8c3/0xac0
      [  196.249701]  ? _copy_from_user+0x2e/0x60
      [  196.249897]  ? rw_copy_check_uvector+0x53/0x110
      [  196.250106]  ? _copy_from_user+0x2e/0x60
      [  196.250299]  ? copy_msghdr_from_user+0xce/0x140
      [  196.250508]  sock_sendmsg+0x36/0x40
      [  196.250690]  ___sys_sendmsg+0x292/0x2a0
      [  196.250881]  ? _cond_resched+0x15/0x30
      [  196.251074]  ? copy_termios+0x1e/0x70
      [  196.251261]  ? _copy_to_user+0x22/0x30
      [  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
      [  196.251782]  ? _cond_resched+0x15/0x30
      [  196.251972]  ? mutex_lock+0xe/0x30
      [  196.252152]  ? vvar_fault+0xd2/0x110
      [  196.252337]  ? __do_fault+0x1f/0xc0
      [  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
      [  196.252727]  ? __sys_sendmsg+0x63/0xa0
      [  196.252919]  __sys_sendmsg+0x63/0xa0
      [  196.253107]  do_syscall_64+0x72/0x200
      [  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  196.253530] RIP: 0033:0x7fc4480b0690
      [  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
      [  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
      [  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
      [  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
      [  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
      [  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
      [  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
      [  196.256445] CR2: 0000000000000000
      [  196.256676] ---[ end trace 71af7d093603885c ]---
      
      Fixes: 8936ef76 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
      Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a957fa19
  8. 19 4月, 2018 1 次提交
    • P
      netfilter: nf_tables: NAT chain and extensions require NF_TABLES · 39f2ff08
      Pablo Neira Ayuso 提交于
      Move these options inside the scope of the 'if' NF_TABLES and
      NF_TABLES_IPV6 dependencies. This patch fixes:
      
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'
      
      that happens with:
      
      CONFIG_NF_TABLES=m
      CONFIG_NFT_CHAIN_NAT_IPV6=y
      
      Fixes: 02c7b25e ("netfilter: nf_tables: build-in filter chain type")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      39f2ff08
  9. 16 4月, 2018 1 次提交
  10. 06 4月, 2018 5 次提交
    • J
      net/ipv6: Increment OUTxxx counters after netfilter hook · 71a1c915
      Jeff Barnhill 提交于
      At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
      IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
      for NFPROTO_IPV6 / NF_INET_FORWARD.  As a result, these counters get
      incremented regardless of whether or not the netfilter hook allows the
      packet to continue being processed.  This change increments the counters
      in ip6_forward_finish() so that it will not happen if the netfilter hook
      chooses to terminate the packet, which is similar to how IPv4 works.
      Signed-off-by: NJeff Barnhill <0xeffeff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71a1c915
    • E
      vti6: better validate user provided tunnel names · 537b361f
      Eric Dumazet 提交于
      Use valid_name() to make sure user does not provide illegal
      device name.
      
      Fixes: ed1efb2a ("ipv6: Add support for IPsec virtual tunnel interfaces")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      537b361f
    • E
      ip6_tunnel: better validate user provided tunnel names · db7a65e3
      Eric Dumazet 提交于
      Use valid_name() to make sure user does not provide illegal
      device name.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db7a65e3
    • E
      ip6_gre: better validate user provided tunnel names · 5f42df01
      Eric Dumazet 提交于
      Use dev_valid_name() to make sure user does not provide illegal
      device name.
      
      syzbot caught the following bug :
      
      BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
      BUG: KASAN: stack-out-of-bounds in ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339
      Write of size 20 at addr ffff8801afb9f7b8 by task syzkaller851048/4466
      
      CPU: 1 PID: 4466 Comm: syzkaller851048 Not tainted 4.16.0+ #1
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x1b9/0x29f lib/dump_stack.c:53
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       memcpy+0x37/0x50 mm/kasan/kasan.c:303
       strlcpy include/linux/string.h:300 [inline]
       ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339
       ip6gre_tunnel_ioctl+0x69d/0x12e0 net/ipv6/ip6_gre.c:1195
       dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
       dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
       sock_ioctl+0x47e/0x680 net/socket.c:1015
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:500 [inline]
       do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
       ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
       SYSC_ioctl fs/ioctl.c:708 [inline]
       SyS_ioctl+0x24/0x30 fs/ioctl.c:706
       do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f42df01
    • E
      ipv6: sit: better validate user provided tunnel names · b95211e0
      Eric Dumazet 提交于
      Use dev_valid_name() to make sure user does not provide illegal
      device name.
      
      syzbot caught the following bug :
      
      BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
      BUG: KASAN: stack-out-of-bounds in ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
      Write of size 33 at addr ffff8801b64076d8 by task syzkaller932654/4453
      
      CPU: 0 PID: 4453 Comm: syzkaller932654 Not tainted 4.16.0+ #1
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x1b9/0x29f lib/dump_stack.c:53
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       memcpy+0x37/0x50 mm/kasan/kasan.c:303
       strlcpy include/linux/string.h:300 [inline]
       ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
       ipip6_tunnel_ioctl+0xe71/0x241b net/ipv6/sit.c:1221
       dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
       dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
       sock_ioctl+0x47e/0x680 net/socket.c:1015
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:500 [inline]
       do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
       ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
       SYSC_ioctl fs/ioctl.c:708 [inline]
       SyS_ioctl+0x24/0x30 fs/ioctl.c:706
       do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b95211e0
  11. 05 4月, 2018 1 次提交
  12. 04 4月, 2018 5 次提交
  13. 02 4月, 2018 3 次提交
  14. 01 4月, 2018 11 次提交
  15. 31 3月, 2018 4 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      net: Introduce __inet_bind() and __inet6_bind · 3679d585
      Andrey Ignatov 提交于
      Refactor `bind()` code to make it ready to be called from BPF helper
      function `bpf_bind()` (will be added soon). Implementation of
      `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
      `__inet6_bind()` correspondingly. These function can be used from both
      `sk_prot->bind` and `bpf_bind()` contexts.
      
      New functions have two additional arguments.
      
      `force_bind_address_no_port` forces binding to IP only w/o checking
      `inet_sock.bind_address_no_port` field. It'll allow to bind local end of
      a connection to desired IP in `bpf_bind()` w/o changing
      `bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
      can return an error and we'd need to restore original value of
      `bind_address_no_port` in that case if we changed this before calling to
      the helper.
      
      `with_lock` specifies whether to lock socket when working with `struct
      sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
      behavior is preserved. But it will be set to `false` for `bpf_bind()`
      use-case. The reason is all call-sites, where `bpf_bind()` will be
      called, already hold that socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3679d585
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d