1. 27 4月, 2017 3 次提交
  2. 26 4月, 2017 1 次提交
    • D
      net: Generic XDP · b5cdae32
      David S. Miller 提交于
      This provides a generic SKB based non-optimized XDP path which is used
      if either the driver lacks a specific XDP implementation, or the user
      requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
      
      It is arguable that perhaps I should have required something like
      this as part of the initial XDP feature merge.
      
      I believe this is critical for two reasons:
      
      1) Accessibility.  More people can play with XDP with less
         dependencies.  Yes I know we have XDP support in virtio_net, but
         that just creates another depedency for learning how to use this
         facility.
      
         I wrote this to make life easier for the XDP newbies.
      
      2) As a model for what the expected semantics are.  If there is a pure
         generic core implementation, it serves as a semantic example for
         driver folks adding XDP support.
      
      One thing I have not tried to address here is the issue of
      XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
      incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
      whatever even if the XDP program doesn't try to push headers at all.
      I think we really need the verifier to somehow propagate whether
      certain XDP helpers are used or not.
      
      v5:
       - Handle both negative and positive offset after running prog
       - Fix mac length in XDP_TX case (Alexei)
       - Use rcu_dereference_protected() in free_netdev (kbuild test robot)
      
      v4:
       - Fix MAC header adjustmnet before calling prog (David Ahern)
       - Disable LRO when generic XDP is installed (Michael Chan)
       - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
       - Do not perform generic XDP on reinjected packets (DaveM)
      
      v3:
       - Make sure XDP program sees packet at MAC header, push back MAC
         header if we do XDP_TX.  (Alexei)
       - Elide GRO when generic XDP is in use.  (Alexei)
       - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
         XDP even if the driver has an XDP implementation.  (Alexei)
       - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
         attribute.  (Daniel)
      
      v2:
       - Add some "fall through" comments in switch statements based
         upon feedback from Andrew Lunn
       - Use RCU for generic xdp_prog, thanks to Johannes Berg.
      Tested-by: NAndy Gospodarek <andy@greyhouse.net>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5cdae32
  3. 25 4月, 2017 21 次提交
  4. 23 4月, 2017 1 次提交
  5. 22 4月, 2017 8 次提交
    • D
      net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface. · 1f4407e2
      David S. Miller 提交于
      We are not supposed to add new entries to this thing
      any more.
      
      Thanks to Eric Dumazet for noticing this.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f4407e2
    • M
      bonding: fix wq initialization for links created via netlink · ea8ffc08
      Mahesh Bandewar 提交于
      Earlier patch 4493b81b ("bonding: initialize work-queues during
      creation of bond") moved the work-queue initialization from bond_open()
      to bond_create(). However this caused the link those are created using
      netlink 'create bond option' (ip link add bondX type bond); create the
      new trunk without initializing work-queues. Prior to the above mentioned
      change, ndo_open was in both paths and things worked correctly. The
      consequence is visible in the report shared by Joe Stringer -
      
      I've noticed that this patch breaks bonding within namespaces if
      you're not careful to perform device cleanup correctly.
      
      Here's my repro script, you can run on any net-next with this patch
      and you'll start seeing some weird behaviour:
      
      ip netns add foo
      ip li add veth0 type veth peer name veth0+ netns foo
      ip li add veth1 type veth peer name veth1+ netns foo
      ip netns exec foo ip li add bond0 type bond
      ip netns exec foo ip li set dev veth0+ master bond0
      ip netns exec foo ip li set dev veth1+ master bond0
      ip netns exec foo ip addr add dev bond0 192.168.0.1/24
      ip netns exec foo ip li set dev bond0 up
      ip li del dev veth0
      ip li del dev veth1
      
      The second to last command segfaults, last command hangs. rtnl is now
      permanently locked. It's not a problem if you take bond0 down before
      deleting veths, or delete bond0 before deleting veths. If you delete
      either end of the veth pair as per above, either inside or outside the
      namespace, it hits this problem.
      
      Here's some kernel logs:
      [ 1221.801610] bond0: Enslaving veth0+ as an active interface with an up link
      [ 1224.449581] bond0: Enslaving veth1+ as an active interface with an up link
      [ 1281.193863] bond0: Releasing backup interface veth0+
      [ 1281.193866] bond0: the permanent HWaddr of veth0+ -
      16:bf:fb:e0:b8:43 - is still in use by bond0 - set the HWaddr of
      veth0+ to a different address to avoid conflicts
      [ 1281.193867] ------------[ cut here ]------------
      [ 1281.193873] WARNING: CPU: 0 PID: 2024 at kernel/workqueue.c:1511
      __queue_delayed_work+0x13f/0x150
      [ 1281.193873] Modules linked in: bonding veth openvswitch nf_nat_ipv6
      nf_nat_ipv4 nf_nat autofs4 nfsd auth_rpcgss nfs_acl binfmt_misc nfs
      lockd grace sunrpc fscache ppdev vmw_balloon coretemp psmouse
      serio_raw vmwgfx ttm drm_kms_helper vmw_vmci netconsole parport_pc
      configfs drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect sysimgblt
      shpchp mac_hid nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
      nf_defrag_ipv4 nf_conntrack libcrc32c lp parport hid_generic usbhid
      hid mptspi mptscsih e1000 mptbase ahci libahci
      [ 1281.193905] CPU: 0 PID: 2024 Comm: ip Tainted: G        W
      4.10.0-bisect-bond-v0.14 #37
      [ 1281.193906] Hardware name: VMware, Inc. VMware Virtual
      Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
      [ 1281.193906] Call Trace:
      [ 1281.193912]  dump_stack+0x63/0x89
      [ 1281.193915]  __warn+0xd1/0xf0
      [ 1281.193917]  warn_slowpath_null+0x1d/0x20
      [ 1281.193918]  __queue_delayed_work+0x13f/0x150
      [ 1281.193920]  queue_delayed_work_on+0x27/0x40
      [ 1281.193929]  bond_change_active_slave+0x25b/0x670 [bonding]
      [ 1281.193932]  ? synchronize_rcu_expedited+0x27/0x30
      [ 1281.193935]  __bond_release_one+0x489/0x510 [bonding]
      [ 1281.193939]  ? addrconf_notify+0x1b7/0xab0
      [ 1281.193942]  bond_netdev_event+0x2c5/0x2e0 [bonding]
      [ 1281.193944]  ? netconsole_netdev_event+0x124/0x190 [netconsole]
      [ 1281.193947]  notifier_call_chain+0x49/0x70
      [ 1281.193948]  raw_notifier_call_chain+0x16/0x20
      [ 1281.193950]  call_netdevice_notifiers_info+0x35/0x60
      [ 1281.193951]  rollback_registered_many+0x23b/0x3e0
      [ 1281.193953]  unregister_netdevice_many+0x24/0xd0
      [ 1281.193955]  rtnl_delete_link+0x3c/0x50
      [ 1281.193956]  rtnl_dellink+0x8d/0x1b0
      [ 1281.193960]  rtnetlink_rcv_msg+0x95/0x220
      [ 1281.193962]  ? __kmalloc_node_track_caller+0x35/0x280
      [ 1281.193964]  ? __netlink_lookup+0xf1/0x110
      [ 1281.193966]  ? rtnl_newlink+0x830/0x830
      [ 1281.193967]  netlink_rcv_skb+0xa7/0xc0
      [ 1281.193969]  rtnetlink_rcv+0x28/0x30
      [ 1281.193970]  netlink_unicast+0x15b/0x210
      [ 1281.193971]  netlink_sendmsg+0x319/0x390
      [ 1281.193974]  sock_sendmsg+0x38/0x50
      [ 1281.193975]  ___sys_sendmsg+0x25c/0x270
      [ 1281.193978]  ? mem_cgroup_commit_charge+0x76/0xf0
      [ 1281.193981]  ? page_add_new_anon_rmap+0x89/0xc0
      [ 1281.193984]  ? lru_cache_add_active_or_unevictable+0x35/0xb0
      [ 1281.193985]  ? __handle_mm_fault+0x4e9/0x1170
      [ 1281.193987]  __sys_sendmsg+0x45/0x80
      [ 1281.193989]  SyS_sendmsg+0x12/0x20
      [ 1281.193991]  do_syscall_64+0x6e/0x180
      [ 1281.193993]  entry_SYSCALL64_slow_path+0x25/0x25
      [ 1281.193995] RIP: 0033:0x7f6ec122f5a0
      [ 1281.193995] RSP: 002b:00007ffe69e89c48 EFLAGS: 00000246 ORIG_RAX:
      000000000000002e
      [ 1281.193997] RAX: ffffffffffffffda RBX: 00007ffe69e8dd60 RCX: 00007f6ec122f5a0
      [ 1281.193997] RDX: 0000000000000000 RSI: 00007ffe69e89c90 RDI: 0000000000000003
      [ 1281.193998] RBP: 00007ffe69e89c90 R08: 0000000000000000 R09: 0000000000000003
      [ 1281.193999] R10: 00007ffe69e89a10 R11: 0000000000000246 R12: 0000000058f14b9f
      [ 1281.193999] R13: 0000000000000000 R14: 00000000006473a0 R15: 00007ffe69e8e450
      [ 1281.194001] ---[ end trace 713a77486cbfbfa3 ]---
      
      Fixes: 4493b81b ("bonding: initialize work-queues during creation of bond")
      Reported-by: NJoe Stringer <joe@ovn.org>
      Tested-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Acked-by: NAndy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea8ffc08
    • W
      net_sched: move the empty tp check from ->destroy() to ->delete() · 763dbf63
      WANG Cong 提交于
      We could have a race condition where in ->classify() path we
      dereference tp->root and meanwhile a parallel ->destroy() makes it
      a NULL. Daniel cured this bug in commit d9363774
      ("net, sched: respect rcu grace period on cls destruction").
      
      This happens when ->destroy() is called for deleting a filter to
      check if we are the last one in tp, this tp is still linked and
      visible at that time. The root cause of this problem is the semantic
      of ->destroy(), it does two things (for non-force case):
      
      1) check if tp is empty
      2) if tp is empty we could really destroy it
      
      and its caller, if cares, needs to check its return value to see if it
      is really destroyed. Therefore we can't unlink tp unless we know it is
      empty.
      
      As suggested by Daniel, we could actually move the test logic to ->delete()
      so that we can safely unlink tp after ->delete() tells us the last one is
      just deleted and before ->destroy().
      
      Fixes: 1e052be6 ("net_sched: destroy proto tp when all filters are gone")
      Cc: Roi Dayan <roid@mellanox.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      763dbf63
    • D
      net: ipv6: RTF_PCPU should not be settable from userspace · 557c44be
      David Ahern 提交于
      Andrey reported a fault in the IPv6 route code:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff880069809600 task.stack: ffff880062dc8000
      RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
      RSP: 0018:ffff880062dced30 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
      RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
      RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
      FS:  00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
      Call Trace:
       ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
       ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
      ...
      
      Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
      set. Flags passed to the kernel are blindly copied to the allocated
      rt6_info by ip6_route_info_create making a newly inserted route appear
      as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
      and expects rt->dst.from to be set - which it is not since it is not
      really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
      generates the fault.
      
      Fix by checking for the flag and failing with EINVAL.
      
      Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      557c44be
    • D
      bpf: add napi_id read access to __sk_buff · b1d9fc41
      Daniel Borkmann 提交于
      Add napi_id access to __sk_buff for socket filter program types, tc
      program types and other bpf_convert_ctx_access() users. Having access
      to skb->napi_id is useful for per RX queue listener siloing, f.e.
      in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
      used, meaning SO_REUSEPORT enabled listeners can then select the
      corresponding socket at SYN time already [1]. The skb is marked via
      skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).
      
      Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d433902
      ("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
      the NAPI ID associated with the queue for steering, which requires a
      prior sk_mark_napi_id() after the socket was looked up.
      
      Semantics for the __sk_buff napi_id access are similar, meaning if
      skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
      then an invalid napi_id of 0 is returned to the program, otherwise a
      valid non-zero napi_id.
      
        [1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdfSuggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1d9fc41
    • M
      Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning · 7acf8a1e
      Matthew Whitehead 提交于
      Constants used for tuning are generally a bad idea, especially as hardware
      changes over time. Replace the constant 2 jiffies with sysctl variable
      netdev_budget_usecs to enable sysadmins to tune the softirq processing.
      Also document the variable.
      
      For example, a very fast machine might tune this to 1000 microseconds,
      while my regression testing 486DX-25 needs it to be 4000 microseconds on
      a nearly idle network to prevent time_squeeze from being incremented.
      
      Version 2: changed jiffies to microseconds for predictable units.
      Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7acf8a1e
    • C
      ip_tunnel: Allow policy-based routing through tunnels · 9830ad4c
      Craig Gallek 提交于
      This feature allows the administrator to set an fwmark for
      packets traversing a tunnel.  This allows the use of independent
      routing tables for tunneled packets without the use of iptables.
      
      There is no concept of per-packet routing decisions through IPv4
      tunnels, so this implementation does not need to work with
      per-packet route lookups as the v6 implementation may
      (with IP6_TNL_F_USE_ORIG_FWMARK).
      
      Further, since the v4 tunnel ioctls share datastructures
      (which can not be trivially modified) with the kernel's internal
      tunnel configuration structures, the mark attribute must be stored
      in the tunnel structure itself and passed as a parameter when
      creating or changing tunnel attributes.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9830ad4c
    • C
      ip6_tunnel: Allow policy-based routing through tunnels · 0a473b82
      Craig Gallek 提交于
      This feature allows the administrator to set an fwmark for
      packets traversing a tunnel.  This allows the use of independent
      routing tables for tunneled packets without the use of iptables.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a473b82
  6. 21 4月, 2017 2 次提交
  7. 19 4月, 2017 3 次提交
    • F
      rhashtable: remove insecure_elasticity · 5f8ddeab
      Florian Westphal 提交于
      commit 83e7e4ce ("mac80211: Use rhltable instead of rhashtable")
      removed the last user that made use of 'insecure_elasticity' parameter,
      i.e. the default of 16 is used everywhere.
      
      Replace it with a constant.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f8ddeab
    • X
      sctp: process duplicated strreset out and addstrm out requests correctly · e4dc99c7
      Xin Long 提交于
      Now sctp stream reconf will process a request again even if it's seqno is
      less than asoc->strreset_inseq.
      
      If one request has been done successfully and some data chunks have been
      accepted and then a duplicated strreset out request comes, the streamin's
      ssn will be cleared. It will cause that stream will never receive chunks
      any more because of unsynchronized ssn. It allows a replay attack.
      
      A similar issue also exists when processing addstrm out requests. It will
      cause more extra streams being added.
      
      This patch is to fix it by saving the last 2 results into asoc. When a
      duplicated strreset out or addstrm out request is received, reply it with
      bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
      result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
      
      Note that it saves last 2 results instead of only last 1 result, because
      two requests can be sent together in one chunk.
      
      And note that when receiving a duplicated request, the receiver side will
      still reply it even if the peer has received the response. It's safe, As
      the response will be dropped by the peer.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4dc99c7
    • H
      mmc: sdio: fix alignment issue in struct sdio_func · 5ef1ecf0
      Heiner Kallweit 提交于
      Certain 64-bit systems (e.g. Amlogic Meson GX) require buffers to be
      used for DMA to be 8-byte-aligned. struct sdio_func has an embedded
      small DMA buffer not meeting this requirement.
      When testing switching to descriptor chain mode in meson-gx driver
      SDIO is broken therefore. Fix this by allocating the small DMA buffer
      separately as kmalloc ensures that the returned memory area is
      properly aligned for every basic data type.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Tested-by: NHelmut Klein <hgkr.klein@gmail.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      5ef1ecf0
  8. 18 4月, 2017 1 次提交