1. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  2. 08 7月, 2017 1 次提交
    • W
      bonding: avoid NETDEV_CHANGEMTU event when unregistering slave · f51048c3
      WANG Cong 提交于
      As Hongjun/Nicolas summarized in their original patch:
      
      "
      When a device changes from one netns to another, it's first unregistered,
      then the netns reference is updated and the dev is registered in the new
      netns. Thus, when a slave moves to another netns, it is first
      unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
      the bonding driver. The driver calls bond_release(), which calls
      dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
      the old netns).
      "
      
      This is a very special case, because the device is being unregistered
      no one should still care about the NETDEV_CHANGEMTU event triggered
      at this point, we can avoid broadcasting this event on this path,
      and avoid touching inetdev_event()/addrconf_notify() path.
      
      It requires to export __dev_set_mtu() to bonding driver.
      Reported-by: NHongjun Li <hongjun.li@6wind.com>
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f51048c3
  3. 05 7月, 2017 1 次提交
  4. 03 7月, 2017 6 次提交
    • E
      net: avoid one splat in fib_nl_delrule() · 5361e209
      Eric Dumazet 提交于
      We need to use refcount_set() on a newly created rule to avoid
      following error :
      
      [   64.601749] ------------[ cut here ]------------
      [   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 refcount_sub_and_test+0x75/0xa0
      [   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
      [   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: G        W       4.12.0-smp-DEV #274
      [   64.601771] task: ffff8837bf482040 task.stack: ffff8837bdc08000
      [   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
      [   64.601774] RSP: 0018:ffff8837bdc0f5c0 EFLAGS: 00010286
      [   64.601776] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000000
      [   64.601777] RDX: 0000000000000026 RSI: 0000000000000096 RDI: ffffed06f7b81eae
      [   64.601778] RBP: ffff8837bdc0f5d0 R08: 0000000000000004 R09: fffffbfff4a54c25
      [   64.601779] R10: 00000000cbc500e5 R11: ffffffffa52a6128 R12: ffff881febcf6f24
      [   64.601779] R13: ffff881fbf4eaf00 R14: ffff881febcf6f80 R15: ffff8837d7a4ed00
      [   64.601781] FS:  00007ff5a2f6b700(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
      [   64.601782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   64.601783] CR2: 00007ffcdc70d000 CR3: 0000001f9c91e000 CR4: 00000000001406f0
      [   64.601783] Call Trace:
      [   64.601786]  refcount_dec_and_test+0x11/0x20
      [   64.601790]  fib_nl_delrule+0xc39/0x1630
      [   64.601793]  ? is_bpf_text_address+0xe/0x20
      [   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
      [   64.601798]  ? depot_save_stack+0x133/0x470
      [   64.601801]  ? ns_capable+0x13/0x20
      [   64.601803]  ? __netlink_ns_capable+0xcc/0x100
      [   64.601806]  rtnetlink_rcv_msg+0x23a/0x6a0
      [   64.601808]  ? rtnl_newlink+0x1630/0x1630
      [   64.601811]  ? memset+0x31/0x40
      [   64.601813]  netlink_rcv_skb+0x2d7/0x440
      [   64.601815]  ? rtnl_newlink+0x1630/0x1630
      [   64.601816]  ? netlink_ack+0xaf0/0xaf0
      [   64.601818]  ? kasan_unpoison_shadow+0x35/0x50
      [   64.601820]  ? __kmalloc_node_track_caller+0x4c/0x70
      [   64.601821]  rtnetlink_rcv+0x28/0x30
      [   64.601823]  netlink_unicast+0x422/0x610
      [   64.601824]  ? netlink_attachskb+0x650/0x650
      [   64.601826]  netlink_sendmsg+0x7b7/0xb60
      [   64.601828]  ? netlink_unicast+0x610/0x610
      [   64.601830]  ? netlink_unicast+0x610/0x610
      [   64.601832]  sock_sendmsg+0xba/0xf0
      [   64.601834]  ___sys_sendmsg+0x6a9/0x8c0
      [   64.601835]  ? copy_msghdr_from_user+0x520/0x520
      [   64.601837]  ? __alloc_pages_nodemask+0x160/0x520
      [   64.601839]  ? memcg_write_event_control+0xd60/0xd60
      [   64.601841]  ? __alloc_pages_slowpath+0x1d50/0x1d50
      [   64.601843]  ? kasan_slab_free+0x71/0xc0
      [   64.601845]  ? mem_cgroup_commit_charge+0xb2/0x11d0
      [   64.601847]  ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
      [   64.601849]  ? __handle_mm_fault+0x1af8/0x2810
      [   64.601851]  ? may_open_dev+0xc0/0xc0
      [   64.601852]  ? __pmd_alloc+0x2c0/0x2c0
      [   64.601853]  ? __fdget+0x13/0x20
      [   64.601855]  __sys_sendmsg+0xc6/0x150
      [   64.601856]  ? __sys_sendmsg+0xc6/0x150
      [   64.601857]  ? SyS_shutdown+0x170/0x170
      [   64.601859]  ? handle_mm_fault+0x28a/0x650
      [   64.601861]  SyS_sendmsg+0x12/0x20
      [   64.601863]  entry_SYSCALL_64_fastpath+0x13/0x94
      
      Fixes: 717d1e99 ("net: convert fib_rule.refcnt from atomic_t to refcount_t")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5361e209
    • A
      net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64 · 9af9959e
      Alban Browaeys 提交于
      commit 9256645a ("net/core: relax BUILD_BUG_ON in
      netdev_stats_to_stats64") made an attempt to read beyond
      the size of the source a possibility.
      
      Fix to only copy src size to dest. As dest might be bigger than src.
      
       ==================================================================
       BUG: KASAN: slab-out-of-bounds in netdev_stats_to_stats64+0xe/0x30 at addr ffff8801be248b20
       Read of size 192 by task VBoxNetAdpCtl/6734
       CPU: 1 PID: 6734 Comm: VBoxNetAdpCtl Tainted: G           O    4.11.4prahal+intel+ #118
       Hardware name: LENOVO 20CDCTO1WW/20CDCTO1WW, BIOS GQET52WW (1.32 ) 05/04/2017
       Call Trace:
        dump_stack+0x63/0x86
        kasan_object_err+0x1c/0x70
        kasan_report+0x270/0x520
        ? netdev_stats_to_stats64+0xe/0x30
        ? sched_clock_cpu+0x1b/0x190
        ? __module_address+0x3e/0x3b0
        ? unwind_next_frame+0x1ea/0xb00
        check_memory_region+0x13c/0x1a0
        memcpy+0x23/0x50
        netdev_stats_to_stats64+0xe/0x30
        dev_get_stats+0x1b9/0x230
        rtnl_fill_stats+0x44/0xc00
        ? nla_put+0xc6/0x130
        rtnl_fill_ifinfo+0xe9e/0x3700
        ? rtnl_fill_vfinfo+0xde0/0xde0
        ? sched_clock+0x9/0x10
        ? sched_clock+0x9/0x10
        ? sched_clock_local+0x120/0x130
        ? __module_address+0x3e/0x3b0
        ? unwind_next_frame+0x1ea/0xb00
        ? sched_clock+0x9/0x10
        ? sched_clock+0x9/0x10
        ? sched_clock_cpu+0x1b/0x190
        ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? depot_save_stack+0x1d8/0x4a0
        ? depot_save_stack+0x34f/0x4a0
        ? depot_save_stack+0x34f/0x4a0
        ? save_stack+0xb1/0xd0
        ? save_stack_trace+0x16/0x20
        ? save_stack+0x46/0xd0
        ? kasan_slab_alloc+0x12/0x20
        ? __kmalloc_node_track_caller+0x10d/0x350
        ? __kmalloc_reserve.isra.36+0x2c/0xc0
        ? __alloc_skb+0xd0/0x560
        ? rtmsg_ifinfo_build_skb+0x61/0x120
        ? rtmsg_ifinfo.part.25+0x16/0xb0
        ? rtmsg_ifinfo+0x47/0x70
        ? register_netdev+0x15/0x30
        ? vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
        ? vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? do_vfs_ioctl+0x17f/0xff0
        ? SyS_ioctl+0x74/0x80
        ? do_syscall_64+0x182/0x390
        ? __alloc_skb+0xd0/0x560
        ? __alloc_skb+0xd0/0x560
        ? save_stack_trace+0x16/0x20
        ? init_object+0x64/0xa0
        ? ___slab_alloc+0x1ae/0x5c0
        ? ___slab_alloc+0x1ae/0x5c0
        ? __alloc_skb+0xd0/0x560
        ? sched_clock+0x9/0x10
        ? kasan_unpoison_shadow+0x35/0x50
        ? kasan_kmalloc+0xad/0xe0
        ? __kmalloc_node_track_caller+0x246/0x350
        ? __alloc_skb+0xd0/0x560
        ? kasan_unpoison_shadow+0x35/0x50
        ? memset+0x31/0x40
        ? __alloc_skb+0x31f/0x560
        ? napi_consume_skb+0x320/0x320
        ? br_get_link_af_size_filtered+0xb7/0x120 [bridge]
        ? if_nlmsg_size+0x440/0x630
        rtmsg_ifinfo_build_skb+0x83/0x120
        rtmsg_ifinfo.part.25+0x16/0xb0
        rtmsg_ifinfo+0x47/0x70
        register_netdevice+0xa2b/0xe50
        ? __kmalloc+0x171/0x2d0
        ? netdev_change_features+0x80/0x80
        register_netdev+0x15/0x30
        vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
        vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        ? vboxNetAdpComposeMACAddress+0x1d0/0x1d0 [vboxnetadp]
        ? kasan_check_write+0x14/0x20
        VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        ? VBoxNetAdpLinuxOpen+0x20/0x20 [vboxnetadp]
        ? lock_acquire+0x11c/0x270
        ? __audit_syscall_entry+0x2fb/0x660
        do_vfs_ioctl+0x17f/0xff0
        ? __audit_syscall_entry+0x2fb/0x660
        ? ioctl_preallocate+0x1d0/0x1d0
        ? __audit_syscall_entry+0x2fb/0x660
        ? kmem_cache_free+0xb2/0x250
        ? syscall_trace_enter+0x537/0xd00
        ? exit_to_usermode_loop+0x100/0x100
        SyS_ioctl+0x74/0x80
        ? do_sys_open+0x350/0x350
        ? do_vfs_ioctl+0xff0/0xff0
        do_syscall_64+0x182/0x390
        entry_SYSCALL64_slow_path+0x25/0x25
       RIP: 0033:0x7f7e39a1ae07
       RSP: 002b:00007ffc6f04c6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
       RAX: ffffffffffffffda RBX: 00007ffc6f04c730 RCX: 00007f7e39a1ae07
       RDX: 00007ffc6f04c730 RSI: 00000000c0207601 RDI: 0000000000000007
       RBP: 00007ffc6f04c700 R08: 00007ffc6f04c780 R09: 0000000000000008
       R10: 0000000000000541 R11: 0000000000000206 R12: 0000000000000007
       R13: 00000000c0207601 R14: 00007ffc6f04c730 R15: 0000000000000012
       Object at ffff8801be248008, in cache kmalloc-4096 size: 4096
       Allocated:
       PID = 6734
        save_stack_trace+0x16/0x20
        save_stack+0x46/0xd0
        kasan_kmalloc+0xad/0xe0
        __kmalloc+0x171/0x2d0
        alloc_netdev_mqs+0x8a7/0xbe0
        vboxNetAdpOsCreate+0x65/0x1c0 [vboxnetadp]
        vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
        VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
        do_vfs_ioctl+0x17f/0xff0
        SyS_ioctl+0x74/0x80
        do_syscall_64+0x182/0x390
        return_from_SYSCALL_64+0x0/0x6a
       Freed:
       PID = 5600
        save_stack_trace+0x16/0x20
        save_stack+0x46/0xd0
        kasan_slab_free+0x73/0xc0
        kfree+0xe4/0x220
        kvfree+0x25/0x30
        single_release+0x74/0xb0
        __fput+0x265/0x6b0
        ____fput+0x9/0x10
        task_work_run+0xd5/0x150
        exit_to_usermode_loop+0xe2/0x100
        do_syscall_64+0x26c/0x390
        return_from_SYSCALL_64+0x0/0x6a
       Memory state around the buggy address:
        ffff8801be248a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff8801be248b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       >ffff8801be248b80: 00 00 00 00 00 00 00 00 00 00 00 07 fc fc fc fc
                                                           ^
        ffff8801be248c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff8801be248c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ==================================================================
      Signed-off-by: NAlban Browaeys <alban.browaeys@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9af9959e
    • D
      bpf: simplify narrower ctx access · f96da094
      Daniel Borkmann 提交于
      This work tries to make the semantics and code around the
      narrower ctx access a bit easier to follow. Right now
      everything is done inside the .is_valid_access(). Offset
      matching is done differently for read/write types, meaning
      writes don't support narrower access and thus matching only
      on offsetof(struct foo, bar) is enough whereas for read
      case that supports narrower access we must check for
      offsetof(struct foo, bar) + offsetof(struct foo, bar) +
      sizeof(<bar>) - 1 for each of the cases. For read cases of
      individual members that don't support narrower access (like
      packet pointers or skb->cb[] case which has its own narrow
      access logic), we check as usual only offsetof(struct foo,
      bar) like in write case. Then, for the case where narrower
      access is allowed, we also need to set the aux info for the
      access. Meaning, ctx_field_size and converted_op_size have
      to be set. First is the original field size e.g. sizeof(<bar>)
      as in above example from the user facing ctx, and latter
      one is the target size after actual rewrite happened, thus
      for the kernel facing ctx. Also here we need the range match
      and we need to keep track changing convert_ctx_access() and
      converted_op_size from is_valid_access() as both are not at
      the same location.
      
      We can simplify the code a bit: check_ctx_access() becomes
      simpler in that we only store ctx_field_size as a meta data
      and later in convert_ctx_accesses() we fetch the target_size
      right from the location where we do convert. Should the verifier
      be misconfigured we do reject for BPF_WRITE cases or target_size
      that are not provided. For the subsystems, we always work on
      ranges in is_valid_access() and add small helpers for ranges
      and narrow access, convert_ctx_accesses() sets target_size
      for the relevant instruction.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96da094
    • D
      bpf: add bpf_skb_adjust_room helper · 2be7e212
      Daniel Borkmann 提交于
      This work adds a helper that can be used to adjust net room of an
      skb. The helper is generic and can be further extended in future.
      Main use case is for having a programmatic way to add/remove room to
      v4/v6 header options along with cls_bpf on egress and ingress hook
      of the data path. It reuses most of the infrastructure that we added
      for the bpf_skb_change_type() helper which can be used in nat64
      translations. Similarly, the helper only takes care of adjusting the
      room so that related data is populated and csum adapted out of the
      BPF program using it.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2be7e212
    • D
      bpf, net: add skb_mac_header_len helper · 0daf4349
      Daniel Borkmann 提交于
      Add a small skb_mac_header_len() helper similarly as the
      skb_network_header_len() we have and replace open coded
      places in BPF's bpf_skb_change_proto() helper. Will also
      be used in upcoming work.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0daf4349
    • L
      bpf: fix to bpf_setsockops · a5192c52
      Lawrence Brakmo 提交于
      Fixed build error due to misplaced "#ifdef CONFIG_INET" (moved 1
      statement up).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5192c52
  5. 02 7月, 2017 5 次提交
    • L
      bpf: Adds support for setting sndcwnd clamp · 13bf9641
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
      sets the initial congestion window. It is useful to limit the sndcwnd
      when the host are close to each other (small RTT).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bf9641
    • L
      bpf: Adds support for setting initial cwnd · fc747810
      Lawrence Brakmo 提交于
      Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
      initial congestion window. This can be used when the hosts are far
      apart (large RTTs) and it is safe to start with a large inital cwnd.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc747810
    • L
      bpf: Add support for changing congestion control · 91b5b21c
      Lawrence Brakmo 提交于
      Added support for changing congestion control for SOCK_OPS bpf
      programs through the setsockopt bpf helper function. It also adds
      a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
      congestion controls, like dctcp, that need to enable ECN in the
      SYN packets.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91b5b21c
    • L
      bpf: Add setsockopt helper function to bpf · 8c4b4c7e
      Lawrence Brakmo 提交于
      Added support for calling a subset of socket setsockopts from
      BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
      than making the changes to call the socket setsockopt function because
      the changes required would have been larger.
      
      The ops supported are:
        SO_RCVBUF
        SO_SNDBUF
        SO_MAX_PACING_RATE
        SO_PRIORITY
        SO_RCVLOWAT
        SO_MARK
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c4b4c7e
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  6. 01 7月, 2017 9 次提交
  7. 30 6月, 2017 3 次提交
    • A
      ethtool: don't open-code memdup_user() · 30e7e3ec
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      30e7e3ec
    • M
      net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · e44699d2
      Michal Kubeček 提交于
      Recently I started seeing warnings about pages with refcount -1. The
      problem was traced to packets being reused after their head was merged into
      a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
      commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
      I have never seen it on a kernel with it reverted, I believe the real
      problem appeared earlier when the option to merge head frag in GRO was
      implemented.
      
      Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
      branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
      and head is merged (which in my case happens after the skb_condense()
      call added by the commit mentioned above), the skb is reused including the
      head that has been merged. As a result, we release the page reference
      twice and eventually end up with negative page refcount.
      
      To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
      the same way it's done in napi_skb_finish().
      
      Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e44699d2
    • A
      net: constify attribute_group structures. · 38ef00cc
      Arvind Yadav 提交于
      attribute_groups are not supposed to change at runtime. All functions
      working with attribute_groups provided by <linux/device.h> work with const
      attribute_group. So mark the non-const structs as const.
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         9968	   3168	     16	  13152	   3360	net/core/net-sysfs.o
      
      File size After adding 'const':
         text	   data	    bss	    dec	    hex	filename
        10160	   2976	     16	  13152	   3360	net/core/net-sysfs.o
      Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ef00cc
  8. 28 6月, 2017 1 次提交
  9. 27 6月, 2017 5 次提交
  10. 25 6月, 2017 1 次提交
    • J
      net: store port/representator id in metadata_dst · 3fcece12
      Jakub Kicinski 提交于
      Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
      representators and control messages over single set of hardware queues.
      Control messages and muxed traffic may need ordered delivery.
      
      Those requirements make it hard to comfortably use TC infrastructure today
      unless we have a way of attaching metadata to skbs at the upper device.
      Because single set of queues is used for many netdevs stopping TC/sched
      queues of all of them reliably is impossible and lower device has to
      retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
      the fastpath.
      
      This patch attempts to enable port/representative devs to attach metadata
      to skbs which carry port id.  This way representatives can be queueless and
      all queuing can be performed at the lower netdev in the usual way.
      
      Traffic arriving on the port/representative interfaces will be have
      metadata attached and will subsequently be queued to the lower device for
      transmission.  The lower device should recognize the metadata and translate
      it to HW specific format which is most likely either a special header
      inserted before the network headers or descriptor/metadata fields.
      
      Metadata is associated with the lower device by storing the netdev pointer
      along with port id so that if TC decides to redirect or mirror the new
      netdev will not try to interpret it.
      
      This is mostly for SR-IOV devices since switches don't have lower netdevs
      today.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcece12
  11. 24 6月, 2017 4 次提交
  12. 21 6月, 2017 3 次提交
    • D
      net: introduce SO_PEERGROUPS getsockopt · 28b5ba2a
      David Herrmann 提交于
      This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
      retrieve the auxiliary groups of the remote peer. It is designed to
      naturally extend SO_PEERCRED. That is, the underlying data is from the
      same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
      is, if the provided buffer is too small, ERANGE is returned and @optlen
      is updated. Otherwise, the information is copied, @optlen is set to the
      actual size, and 0 is returned.
      
      While SO_PEERCRED (and thus `struct ucred') already returns the primary
      group, it lacks the auxiliary group vector. However, nearly all access
      controls (including kernel side VFS and SYSVIPC, but also user-space
      polkit, DBus, ...) consider the entire set of groups, rather than just
      the primary group. But this is currently not possible with pure
      SO_PEERCRED. Instead, user-space has to work around this and query the
      system database for the auxiliary groups of a UID retrieved via
      SO_PEERCRED.
      
      Unfortunately, there is no race-free way to query the auxiliary groups
      of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
      solution is to use getgrouplist(3p), which itself falls back to NSS and
      whatever is configured in nsswitch.conf(3). This effectively checks
      which groups we *would* assign to the user if it logged in *now*. On
      normal systems it is as easy as reading /etc/group, but with NSS it can
      resort to quering network databases (eg., LDAP), using IPC or network
      communication.
      
      Long story short: Whenever we want to use auxiliary groups for access
      checks on IPC, we need further IPC to talk to the user/group databases,
      rather than just relying on SO_PEERCRED and the incoming socket. This
      is unfortunate, and might even result in dead-locks if the database
      query uses the same IPC as the original request.
      
      So far, those recursions / dead-locks have been avoided by using
      primitive IPC for all crucial NSS modules. However, we want to avoid
      re-inventing the wheel for each NSS module that might be involved in
      user/group queries. Hence, we would preferably make DBus (and other IPC
      that supports access-management based on groups) work without resorting
      to the user/group database. This new SO_PEERGROUPS ioctl would allow us
      to make dbus-daemon work without ever calling into NSS.
      
      Cc: Michal Sekletar <msekleta@redhat.com>
      Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b5ba2a
    • J
      rtnetlink: add restricted rtnl groups for ipv4 and ipv6 mroute · 5f729eaa
      Julien Gomes 提交于
      Add RTNLGRP_{IPV4,IPV6}_MROUTE_R as two new restricted groups for the
      NETLINK_ROUTE family.
      Binding to these groups specifically requires CAP_NET_ADMIN to allow
      multicast of sensitive messages (e.g. mroute cache reports).
      Suggested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NJulien Gomes <julien@arista.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f729eaa
    • S
      rtnetlink: add IFLA_GROUP to ifla_policy · db833d40
      Serhey Popovych 提交于
      Network interface groups support added while ago, however
      there is no IFLA_GROUP attribute description in policy
      and netlink message size calculations until now.
      
      Add IFLA_GROUP attribute to the policy.
      
      Fixes: cbda10fa ("net_device: add support for network device groups")
      Signed-off-by: NSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db833d40