1. 04 7月, 2017 11 次提交
  2. 03 7月, 2017 10 次提交
    • D
      openvswitch: fix mis-ordered comment lines for ovs_skb_cb · 52427fa0
      Daniel Axtens 提交于
      I was trying to wrap my head around meaning of mru, and realised
      that the second line of the comment defining it had somehow
      ended up after the line defining cutlen, leading to much confusion.
      
      Reorder the lines to make sense.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52427fa0
    • E
      net: make sk_ehashfn() static · 784c372a
      Eric Dumazet 提交于
      sk_ehashfn() is only used from a single file.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      784c372a
    • E
      net: avoid one splat in fib_nl_delrule() · 5361e209
      Eric Dumazet 提交于
      We need to use refcount_set() on a newly created rule to avoid
      following error :
      
      [   64.601749] ------------[ cut here ]------------
      [   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 refcount_sub_and_test+0x75/0xa0
      [   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
      [   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: G        W       4.12.0-smp-DEV #274
      [   64.601771] task: ffff8837bf482040 task.stack: ffff8837bdc08000
      [   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
      [   64.601774] RSP: 0018:ffff8837bdc0f5c0 EFLAGS: 00010286
      [   64.601776] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000000
      [   64.601777] RDX: 0000000000000026 RSI: 0000000000000096 RDI: ffffed06f7b81eae
      [   64.601778] RBP: ffff8837bdc0f5d0 R08: 0000000000000004 R09: fffffbfff4a54c25
      [   64.601779] R10: 00000000cbc500e5 R11: ffffffffa52a6128 R12: ffff881febcf6f24
      [   64.601779] R13: ffff881fbf4eaf00 R14: ffff881febcf6f80 R15: ffff8837d7a4ed00
      [   64.601781] FS:  00007ff5a2f6b700(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
      [   64.601782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   64.601783] CR2: 00007ffcdc70d000 CR3: 0000001f9c91e000 CR4: 00000000001406f0
      [   64.601783] Call Trace:
      [   64.601786]  refcount_dec_and_test+0x11/0x20
      [   64.601790]  fib_nl_delrule+0xc39/0x1630
      [   64.601793]  ? is_bpf_text_address+0xe/0x20
      [   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
      [   64.601798]  ? depot_save_stack+0x133/0x470
      [   64.601801]  ? ns_capable+0x13/0x20
      [   64.601803]  ? __netlink_ns_capable+0xcc/0x100
      [   64.601806]  rtnetlink_rcv_msg+0x23a/0x6a0
      [   64.601808]  ? rtnl_newlink+0x1630/0x1630
      [   64.601811]  ? memset+0x31/0x40
      [   64.601813]  netlink_rcv_skb+0x2d7/0x440
      [   64.601815]  ? rtnl_newlink+0x1630/0x1630
      [   64.601816]  ? netlink_ack+0xaf0/0xaf0
      [   64.601818]  ? kasan_unpoison_shadow+0x35/0x50
      [   64.601820]  ? __kmalloc_node_track_caller+0x4c/0x70
      [   64.601821]  rtnetlink_rcv+0x28/0x30
      [   64.601823]  netlink_unicast+0x422/0x610
      [   64.601824]  ? netlink_attachskb+0x650/0x650
      [   64.601826]  netlink_sendmsg+0x7b7/0xb60
      [   64.601828]  ? netlink_unicast+0x610/0x610
      [   64.601830]  ? netlink_unicast+0x610/0x610
      [   64.601832]  sock_sendmsg+0xba/0xf0
      [   64.601834]  ___sys_sendmsg+0x6a9/0x8c0
      [   64.601835]  ? copy_msghdr_from_user+0x520/0x520
      [   64.601837]  ? __alloc_pages_nodemask+0x160/0x520
      [   64.601839]  ? memcg_write_event_control+0xd60/0xd60
      [   64.601841]  ? __alloc_pages_slowpath+0x1d50/0x1d50
      [   64.601843]  ? kasan_slab_free+0x71/0xc0
      [   64.601845]  ? mem_cgroup_commit_charge+0xb2/0x11d0
      [   64.601847]  ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
      [   64.601849]  ? __handle_mm_fault+0x1af8/0x2810
      [   64.601851]  ? may_open_dev+0xc0/0xc0
      [   64.601852]  ? __pmd_alloc+0x2c0/0x2c0
      [   64.601853]  ? __fdget+0x13/0x20
      [   64.601855]  __sys_sendmsg+0xc6/0x150
      [   64.601856]  ? __sys_sendmsg+0xc6/0x150
      [   64.601857]  ? SyS_shutdown+0x170/0x170
      [   64.601859]  ? handle_mm_fault+0x28a/0x650
      [   64.601861]  SyS_sendmsg+0x12/0x20
      [   64.601863]  entry_SYSCALL_64_fastpath+0x13/0x94
      
      Fixes: 717d1e99 ("net: convert fib_rule.refcnt from atomic_t to refcount_t")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5361e209
    • A
      net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64 · 9af9959e
      Alban Browaeys 提交于
      commit 9256645a ("net/core: relax BUILD_BUG_ON in
      netdev_stats_to_stats64") made an attempt to read beyond
      the size of the source a possibility.
      
      Fix to only copy src size to dest. As dest might be bigger than src.
      
       ==================================================================
       BUG: KASAN: slab-out-of-bounds in netdev_stats_to_stats64+0xe/0x30 at addr ffff8801be248b20
       Read of size 192 by task VBoxNetAdpCtl/6734
       CPU: 1 PID: 6734 Comm: VBoxNetAdpCtl Tainted: G           O    4.11.4prahal+intel+ #118
       Hardware name: LENOVO 20CDCTO1WW/20CDCTO1WW, BIOS GQET52WW (1.32 ) 05/04/2017
       Call Trace:
        dump_stack+0x63/0x86
        kasan_object_err+0x1c/0x70
        kasan_report+0x270/0x520
        ? netdev_stats_to_stats64+0xe/0x30
        ? sched_clock_cpu+0x1b/0x190
        ? __module_address+0x3e/0x3b0
        ? unwind_next_frame+0x1ea/0xb00
        check_memory_region+0x13c/0x1a0
        memcpy+0x23/0x50
        netdev_stats_to_stats64+0xe/0x30
        dev_get_stats+0x1b9/0x230
        rtnl_fill_stats+0x44/0xc00
        ? nla_put+0xc6/0x130
        rtnl_fill_ifinfo+0xe9e/0x3700
        ? rtnl_fill_vfinfo+0xde0/0xde0
        ? sched_clock+0x9/0x10
        ? sched_clock+0x9/0x10
        ? sched_clock_local+0x120/0x130
        ? __module_address+0x3e/0x3b0
        ? unwind_next_frame+0x1ea/0xb00
        ? sched_clock+0x9/0x10
        ? sched_clock+0x9/0x10
        ? sched_clock_cpu+0x1b/0x190
        ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? depot_save_stack+0x1d8/0x4a0
        ? depot_save_stack+0x34f/0x4a0
        ? depot_save_stack+0x34f/0x4a0
        ? save_stack+0xb1/0xd0
        ? save_stack_trace+0x16/0x20
        ? save_stack+0x46/0xd0
        ? kasan_slab_alloc+0x12/0x20
        ? __kmalloc_node_track_caller+0x10d/0x350
        ? __kmalloc_reserve.isra.36+0x2c/0xc0
        ? __alloc_skb+0xd0/0x560
        ? rtmsg_ifinfo_build_skb+0x61/0x120
        ? rtmsg_ifinfo.part.25+0x16/0xb0
        ? rtmsg_ifinfo+0x47/0x70
        ? register_netdev+0x15/0x30
        ? vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
        ? vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? do_vfs_ioctl+0x17f/0xff0
        ? SyS_ioctl+0x74/0x80
        ? do_syscall_64+0x182/0x390
        ? __alloc_skb+0xd0/0x560
        ? __alloc_skb+0xd0/0x560
        ? save_stack_trace+0x16/0x20
        ? init_object+0x64/0xa0
        ? ___slab_alloc+0x1ae/0x5c0
        ? ___slab_alloc+0x1ae/0x5c0
        ? __alloc_skb+0xd0/0x560
        ? sched_clock+0x9/0x10
        ? kasan_unpoison_shadow+0x35/0x50
        ? kasan_kmalloc+0xad/0xe0
        ? __kmalloc_node_track_caller+0x246/0x350
        ? __alloc_skb+0xd0/0x560
        ? kasan_unpoison_shadow+0x35/0x50
        ? memset+0x31/0x40
        ? __alloc_skb+0x31f/0x560
        ? napi_consume_skb+0x320/0x320
        ? br_get_link_af_size_filtered+0xb7/0x120 [bridge]
        ? if_nlmsg_size+0x440/0x630
        rtmsg_ifinfo_build_skb+0x83/0x120
        rtmsg_ifinfo.part.25+0x16/0xb0
        rtmsg_ifinfo+0x47/0x70
        register_netdevice+0xa2b/0xe50
        ? __kmalloc+0x171/0x2d0
        ? netdev_change_features+0x80/0x80
        register_netdev+0x15/0x30
        vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
        vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        ? vboxNetAdpComposeMACAddress+0x1d0/0x1d0 [vboxnetadp]
        ? kasan_check_write+0x14/0x20
        VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? VBoxNetAdpLinuxOpen+0x20/0x20 [vboxnetadp]
        ? lock_acquire+0x11c/0x270
        ? __audit_syscall_entry+0x2fb/0x660
        do_vfs_ioctl+0x17f/0xff0
        ? __audit_syscall_entry+0x2fb/0x660
        ? ioctl_preallocate+0x1d0/0x1d0
        ? __audit_syscall_entry+0x2fb/0x660
        ? kmem_cache_free+0xb2/0x250
        ? syscall_trace_enter+0x537/0xd00
        ? exit_to_usermode_loop+0x100/0x100
        SyS_ioctl+0x74/0x80
        ? do_sys_open+0x350/0x350
        ? do_vfs_ioctl+0xff0/0xff0
        do_syscall_64+0x182/0x390
        entry_SYSCALL64_slow_path+0x25/0x25
       RIP: 0033:0x7f7e39a1ae07
       RSP: 002b:00007ffc6f04c6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00007ffc6f04c730 RCX: 00007f7e39a1ae07
       RDX: 00007ffc6f04c730 RSI: 00000000c0207601 RDI: 0000000000000007
       RBP: 00007ffc6f04c700 R08: 00007ffc6f04c780 R09: 0000000000000008
       R10: 0000000000000541 R11: 0000000000000206 R12: 0000000000000007
       R13: 00000000c0207601 R14: 00007ffc6f04c730 R15: 0000000000000012
       Object at ffff8801be248008, in cache kmalloc-4096 size: 4096
       Allocated:
       PID = 6734
        save_stack_trace+0x16/0x20
        save_stack+0x46/0xd0
        kasan_kmalloc+0xad/0xe0
        __kmalloc+0x171/0x2d0
        alloc_netdev_mqs+0x8a7/0xbe0
        vboxNetAdpOsCreate+0x65/0x1c0 [vboxnetadp]
        vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        do_vfs_ioctl+0x17f/0xff0
        SyS_ioctl+0x74/0x80
        do_syscall_64+0x182/0x390
        return_from_SYSCALL_64+0x0/0x6a
       Freed:
       PID = 5600
        save_stack_trace+0x16/0x20
        save_stack+0x46/0xd0
        kasan_slab_free+0x73/0xc0
        kfree+0xe4/0x220
        kvfree+0x25/0x30
        single_release+0x74/0xb0
        __fput+0x265/0x6b0
        ____fput+0x9/0x10
        task_work_run+0xd5/0x150
        exit_to_usermode_loop+0xe2/0x100
        do_syscall_64+0x26c/0x390
        return_from_SYSCALL_64+0x0/0x6a
       Memory state around the buggy address:
        ffff8801be248a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff8801be248b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       >ffff8801be248b80: 00 00 00 00 00 00 00 00 00 00 00 07 fc fc fc fc
                                                           ^
        ffff8801be248c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff8801be248c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      Signed-off-by: NAlban Browaeys <alban.browaeys@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9af9959e
    • D
    • D
      bpf: simplify narrower ctx access · f96da094
      Daniel Borkmann 提交于
      This work tries to make the semantics and code around the
      narrower ctx access a bit easier to follow. Right now
      everything is done inside the .is_valid_access(). Offset
      matching is done differently for read/write types, meaning
      writes don't support narrower access and thus matching only
      on offsetof(struct foo, bar) is enough whereas for read
      case that supports narrower access we must check for
      offsetof(struct foo, bar) + offsetof(struct foo, bar) +
      sizeof(<bar>) - 1 for each of the cases. For read cases of
      individual members that don't support narrower access (like
      packet pointers or skb->cb[] case which has its own narrow
      access logic), we check as usual only offsetof(struct foo,
      bar) like in write case. Then, for the case where narrower
      access is allowed, we also need to set the aux info for the
      access. Meaning, ctx_field_size and converted_op_size have
      to be set. First is the original field size e.g. sizeof(<bar>)
      as in above example from the user facing ctx, and latter
      one is the target size after actual rewrite happened, thus
      for the kernel facing ctx. Also here we need the range match
      and we need to keep track changing convert_ctx_access() and
      converted_op_size from is_valid_access() as both are not at
      the same location.
      
      We can simplify the code a bit: check_ctx_access() becomes
      simpler in that we only store ctx_field_size as a meta data
      and later in convert_ctx_accesses() we fetch the target_size
      right from the location where we do convert. Should the verifier
      be misconfigured we do reject for BPF_WRITE cases or target_size
      that are not provided. For the subsystems, we always work on
      ranges in is_valid_access() and add small helpers for ranges
      and narrow access, convert_ctx_accesses() sets target_size
      for the relevant instruction.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96da094
    • D
      bpf: add bpf_skb_adjust_room helper · 2be7e212
      Daniel Borkmann 提交于
      This work adds a helper that can be used to adjust net room of an
      skb. The helper is generic and can be further extended in future.
      Main use case is for having a programmatic way to add/remove room to
      v4/v6 header options along with cls_bpf on egress and ingress hook
      of the data path. It reuses most of the infrastructure that we added
      for the bpf_skb_change_type() helper which can be used in nat64
      translations. Similarly, the helper only takes care of adjusting the
      room so that related data is populated and csum adapted out of the
      BPF program using it.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2be7e212
    • D
      bpf, net: add skb_mac_header_len helper · 0daf4349
      Daniel Borkmann 提交于
      Add a small skb_mac_header_len() helper similarly as the
      skb_network_header_len() we have and replace open coded
      places in BPF's bpf_skb_change_proto() helper. Will also
      be used in upcoming work.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0daf4349
    • S
      ipv6: dad: don't remove dynamic addresses if link is down · ec8add2a
      Sabrina Dubroca 提交于
      Currently, when the link for $DEV is down, this command succeeds but the
      address is removed immediately by DAD (1):
      
          ip addr add 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
      
      In the same situation, this will succeed and not remove the address (2):
      
          ip addr add 1111::12/64 dev $DEV
          ip addr change 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
      
      The comment in addrconf_dad_begin() when !IF_READY makes it look like
      this is the intended behavior, but doesn't explain why:
      
           * If the device is not ready:
           * - keep it tentative if it is a permanent address.
           * - otherwise, kill it.
      
      We clearly cannot prevent userspace from doing (2), but we can make (1)
      work consistently with (2).
      
      addrconf_dad_stop() is only called in two cases: if DAD failed, or to
      skip DAD when the link is down. In that second case, the fix is to avoid
      deleting the address, like we already do for permanent addresses.
      
      Fixes: 3c21edbd ("[IPV6]: Defer IPv6 device initialization until the link becomes ready.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec8add2a
    • L
      bpf: fix to bpf_setsockops · a5192c52
      Lawrence Brakmo 提交于
      Fixed build error due to misplaced "#ifdef CONFIG_INET" (moved 1
      statement up).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5192c52
  3. 02 7月, 2017 19 次提交
    • L
      bpf: Adds support for setting sndcwnd clamp · 13bf9641
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
      sets the initial congestion window. It is useful to limit the sndcwnd
      when the host are close to each other (small RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bf9641
    • L
      bpf: Adds support for setting initial cwnd · fc747810
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
      initial congestion window. This can be used when the hosts are far
      apart (large RTTs) and it is safe to start with a large inital cwnd.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc747810
    • L
      bpf: Add support for changing congestion control · 91b5b21c
      Lawrence Brakmo 提交于
      Added support for changing congestion control for SOCK_OPS bpf
      programs through the setsockopt bpf helper function. It also adds
      a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
      congestion controls, like dctcp, that need to enable ECN in the
      SYN packets.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b5b21c
    • L
      bpf: Add TCP connection BPF callbacks · 9872a4bd
      Lawrence Brakmo 提交于
      Added callbacks to BPF SOCK_OPS type program before an active
      connection is intialized and after a passive or active connection is
      established.
      
      The following patch demostrates how they can be used to set send and
      receive buffer sizes.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9872a4bd
    • L
      bpf: Add setsockopt helper function to bpf · 8c4b4c7e
      Lawrence Brakmo 提交于
      Added support for calling a subset of socket setsockopts from
      BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
      than making the changes to call the socket setsockopt function because
      the changes required would have been larger.
      
      The ops supported are:
        SO_RCVBUF
        SO_SNDBUF
        SO_MAX_PACING_RATE
        SO_PRIORITY
        SO_RCVLOWAT
        SO_MARK
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c4b4c7e
    • L
      bpf: Support for setting initial receive window · 13d3b1eb
      Lawrence Brakmo 提交于
      This patch adds suppport for setting the initial advertized window from
      within a BPF_SOCK_OPS program. This can be used to support larger
      initial cwnd values in environments where it is known to be safe.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d3b1eb
    • L
      bpf: Support for per connection SYN/SYN-ACK RTOs · 8550f328
      Lawrence Brakmo 提交于
      This patch adds support for setting a per connection SYN and
      SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
      to set small RTOs when it is known both hosts are within a
      datacenter.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8550f328
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
    • N
      sctp: Add peeloff-flags socket option · 2cb5c8e3
      Neil Horman 提交于
      Based on a request raised on the sctp devel list, there is a need to
      augment the sctp_peeloff operation while specifying the O_CLOEXEC and
      O_NONBLOCK flags (simmilar to the socket syscall).  Since modifying the
      SCTP_SOCKOPT_PEELOFF socket option would break user space ABI for existing
      programs, this patch creates a new socket option
      SCTP_SOCKOPT_PEELOFF_FLAGS, which accepts a third flags parameter to
      allow atomic assignment of the socket descriptor flags.
      
      Tested successfully by myself and the requestor
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Andreas Steinmetz <ast@domdv.de>
      CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cb5c8e3
    • T
      datapath: Avoid using stack larger than 1024. · 9cc9a5cb
      Tonghao Zhang 提交于
      When compiling OvS-master on 4.4.0-81 kernel,
      there is a warning:
      
          CC [M]  /root/ovs/datapath/linux/datapath.o
          /root/ovs/datapath/linux/datapath.c: In function
          'ovs_flow_cmd_set':
          /root/ovs/datapath/linux/datapath.c:1221:1: warning:
          the frame size of 1040 bytes is larger than 1024 bytes
          [-Wframe-larger-than=]
      
      This patch factors out match-init and action-copy to avoid
      "Wframe-larger-than=1024" warning. Because mask is only
      used to get actions, we new a function to save some
      stack space.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cc9a5cb
    • X
      sctp: remove the typedef sctp_init_chunk_t · 01a992be
      Xin Long 提交于
      This patch is to remove the typedef sctp_init_chunk_t, and replace
      with struct sctp_init_chunk in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01a992be
    • X
      sctp: remove the typedef sctp_inithdr_t · 4ae70c08
      Xin Long 提交于
      This patch is to remove the typedef sctp_inithdr_t, and replace
      with struct sctp_inithdr in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ae70c08
    • X
      sctp: remove the typedef sctp_data_chunk_t · 9f8d3147
      Xin Long 提交于
      This patch is to remove the typedef sctp_data_chunk_t, and replace
      with struct sctp_data_chunk in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f8d3147
    • X
      sctp: remove the typedef sctp_datahdr_t · 3583df1a
      Xin Long 提交于
      This patch is to remove the typedef sctp_datahdr_t, and replace with
      struct sctp_datahdr in the places where it's using this typedef.
      
      It is also to use izeof(variable) instead of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3583df1a
    • X
      sctp: remove the typedef sctp_param_t · 34b4e29b
      Xin Long 提交于
      This patch is to remove the typedef sctp_param_t, and replace with
      struct sctp_paramhdr in the places where it's using this typedef.
      
      It is also to remove the useless declaration sctp_addip_addr_config
      and fix the lack of params for some other functions' declaration.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34b4e29b
    • X
      sctp: remove the typedef sctp_paramhdr_t · 3c918704
      Xin Long 提交于
      This patch is to remove the typedef sctp_paramhdr_t, and replace
      with struct sctp_paramhdr in the places where it's using this
      typedef.
      
      It is also to fix some indents and  use sizeof(variable) instead
      of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c918704
    • X
      sctp: remove the typedef sctp_cid_t · 6d85e68f
      Xin Long 提交于
      This patch is to remove the typedef sctp_cid_t, and replace
      with struct sctp_cid in the places where it's using this
      typedef.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d85e68f
    • X
      sctp: remove the typedef sctp_chunkhdr_t · 922dbc5b
      Xin Long 提交于
      This patch is to remove the typedef sctp_chunkhdr_t, and replace
      with struct sctp_chunkhdr in the places where it's using this
      typedef.
      
      It is also to fix some indents and use sizeof(variable) instead
      of sizeof(type)., especially in sctp_new.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      922dbc5b
    • X
      sctp: remove the typedef sctp_sctphdr_t · ae146d9b
      Xin Long 提交于
      This patch is to remove the typedef sctp_sctphdr_t, and replace
      with struct sctphdr in the places where it's using this typedef.
      
      It is also to fix some indents and use sizeof(variable) instead
      of sizeof(type).
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae146d9b