1. 26 2月, 2016 1 次提交
    • L
      net: fix bridge multicast packet checksum validation · 9b368814
      Linus Lüssing 提交于
      We need to update the skb->csum after pulling the skb, otherwise
      an unnecessary checksum (re)computation can ocure for IGMP/MLD packets
      in the bridge code. Additionally this fixes the following splats for
      network devices / bridge ports with support for and enabled RX checksum
      offloading:
      
      [...]
      [   43.986968] eth0: hw csum failure
      [   43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2
      [   43.996193] Hardware name: BCM2709
      [   43.999647] [<800204e0>] (unwind_backtrace) from [<8001cf14>] (show_stack+0x10/0x14)
      [   44.007432] [<8001cf14>] (show_stack) from [<801ab614>] (dump_stack+0x80/0x90)
      [   44.014695] [<801ab614>] (dump_stack) from [<802e4548>] (__skb_checksum_complete+0x6c/0xac)
      [   44.023090] [<802e4548>] (__skb_checksum_complete) from [<803a055c>] (ipv6_mc_validate_checksum+0x104/0x178)
      [   44.032959] [<803a055c>] (ipv6_mc_validate_checksum) from [<802e111c>] (skb_checksum_trimmed+0x130/0x188)
      [   44.042565] [<802e111c>] (skb_checksum_trimmed) from [<803a06e8>] (ipv6_mc_check_mld+0x118/0x338)
      [   44.051501] [<803a06e8>] (ipv6_mc_check_mld) from [<803b2c98>] (br_multicast_rcv+0x5dc/0xd00)
      [   44.060077] [<803b2c98>] (br_multicast_rcv) from [<803aa510>] (br_handle_frame_finish+0xac/0x51c)
      [...]
      
      Fixes: 9afd85c9 ("net: Export IGMP/MLD message validation code")
      Reported-by: NÁlvaro Fernández Rojas <noltari@gmail.com>
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b368814
  2. 25 2月, 2016 1 次提交
    • D
      bpf: fix csum setting for bpf_set_tunnel_key · 2da897e5
      Daniel Borkmann 提交于
      The fix in 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly
      controlled.") changed behavior for bpf_set_tunnel_key() when in use with
      IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
      As a result, the stack dropped ingress vxlan IPv6 packets, that have been
      sent via eBPF through collect meta data mode due to checksum now being zero.
      
      Since after LCO, we enable IPv4 checksum by default, so make that analogous
      and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
      IPv4 case.
      
      Fixes: 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
      Fixes: c6c33454 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da897e5
  3. 20 2月, 2016 1 次提交
    • N
      net: make netdev_for_each_lower_dev safe for device removal · cfdd28be
      Nikolay Aleksandrov 提交于
      When I used netdev_for_each_lower_dev in commit bad53162 ("vrf:
      remove slave queue and private slave struct") I thought that it acts
      like netdev_for_each_lower_private and can be used to remove the current
      device from the list while walking, but unfortunately it acts more like
      netdev_for_each_lower_private_rcu and doesn't allow it. The difference
      is where the "iter" points to, right now it points to the current element
      and that makes it impossible to remove it. Change the logic to be
      similar to netdev_for_each_lower_private and make it point to the "next"
      element so we can safely delete the current one. VRF is the only such
      user right now, there's no change for the read-only users.
      
      Here's what can happen now:
      [98423.249858] general protection fault: 0000 [#1] SMP
      [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
      sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
      gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
      parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
      9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
      virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
      ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
      virtio_ring virtio scsi_mod [last unloaded: bridge]
      [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
      4.5.0-rc2+ #81
      [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.8.1-20150318_183358- 04/01/2014
      [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
      ffff88003428c000
      [98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
      netdev_lower_get_next+0x1e/0x30
      [98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
      [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
      0000000000000000
      [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
      ffff880054ff90c0
      [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
      0000000000000000
      [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
      ffff88003428f9e0
      [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
      0000000000000001
      [98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
      knlGS:0000000000000000
      [98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
      00000000000406e0
      [98423.258902] Stack:
      [98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
      ffff880054ff9000
      [98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
      ffff88003428f978
      [98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
      ffff880035b35f00
      [98423.260739] Call Trace:
      [98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
      [98423.261156]  [<ffffffff81518205>]
      rollback_registered_many+0x205/0x390
      [98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
      [98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
      [98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
      [98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
      [98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
      [98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
      [98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
      [98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
      [98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
      [98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
      [98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
      [98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
      [98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
      [98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
      [98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
      [98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
      [98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
      [98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
      [98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
      [98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
      [98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
      90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
      89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
      [98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
      [98423.279920]  RSP <ffff88003428f940>
      
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      Fixes: bad53162 ("vrf: remove slave queue and private slave struct")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdd28be
  4. 19 2月, 2016 1 次提交
    • P
      IFF_NO_QUEUE: Fix for drivers not calling ether_setup() · a813104d
      Phil Sutter 提交于
      My implementation around IFF_NO_QUEUE driver flag assumed that leaving
      tx_queue_len untouched (specifically: not setting it to zero) by drivers
      would make it possible to assign a regular qdisc to them without having
      to worry about setting tx_queue_len to a useful value. This was only
      partially true: I overlooked that some drivers don't call ether_setup()
      and therefore not initialize tx_queue_len to the default value of 1000.
      Consequently, removing the workarounds in place for that case in qdisc
      implementations which cared about it (namely, pfifo, bfifo, gred, htb,
      plug and sfb) leads to problems with these specific interface types and
      qdiscs.
      
      Luckily, there's already a sanitization point for drivers setting
      tx_queue_len to zero, which can be reused to assign the fallback value
      most qdisc implementations used, which is 1.
      
      Fixes: 348e3435 ("net: sched: drop all special handling of tx_queue_len == 0")
      Tested-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a813104d
  5. 17 2月, 2016 1 次提交
  6. 09 2月, 2016 2 次提交
  7. 08 2月, 2016 1 次提交
  8. 21 1月, 2016 1 次提交
  9. 20 1月, 2016 1 次提交
  10. 16 1月, 2016 1 次提交
  11. 15 1月, 2016 4 次提交
  12. 13 1月, 2016 2 次提交
  13. 12 1月, 2016 3 次提交
  14. 11 1月, 2016 3 次提交
    • A
      net/rtnetlink: remove unused sz_idx variable · 617cfc75
      Alexander Kuleshov 提交于
      The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
      not used anywhere. Let's remove it.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      617cfc75
    • D
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann 提交于
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f211a1b
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
  15. 07 1月, 2016 1 次提交
  16. 06 1月, 2016 1 次提交
  17. 05 1月, 2016 2 次提交
    • C
      soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF · 538950a1
      Craig Gallek 提交于
      Expose socket options for setting a classic or extended BPF program
      for use when selecting sockets in an SO_REUSEPORT group.  These options
      can be used on the first socket to belong to a group before bind or
      on any socket in the group after bind.
      
      This change includes refactoring of the existing sk_filter code to
      allow reuse of the existing BPF filter validation checks.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      538950a1
    • C
      soreuseport: define reuseport groups · ef456144
      Craig Gallek 提交于
      struct sock_reuseport is an optional shared structure referenced by each
      socket belonging to a reuseport group.  When a socket is bound to an
      address/port not yet in use and the reuseport flag has been set, the
      structure will be allocated and attached to the newly bound socket.
      When subsequent calls to bind are made for the same address/port, the
      shared structure will be updated to include the new socket and the
      newly bound socket will reference the group structure.
      
      Usually, when an incoming packet was destined for a reuseport group,
      all sockets in the same group needed to be considered before a
      dispatching decision was made.  With this structure, an appropriate
      socket can be found after looking up just one socket in the group.
      
      This shared structure will also allow for more complicated decisions to
      be made when selecting a socket (eg a BPF filter).
      
      This work is based off a similar implementation written by
      Ying Cai <ycai@google.com> for implementing policy-based reuseport
      selection.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef456144
  18. 31 12月, 2015 1 次提交
  19. 23 12月, 2015 1 次提交
  20. 19 12月, 2015 3 次提交
    • D
      bpf: fix misleading comment in bpf_convert_filter · 23bf8807
      Daniel Borkmann 提交于
      Comment says "User BPF's register A is mapped to our BPF register 6",
      which is actually wrong as the mapping is on register 0. This can
      already be inferred from the code itself. So just remove it before
      someone makes assumptions based on that. Only code tells truth. ;)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23bf8807
    • D
      bpf: move clearing of A/X into classic to eBPF migration prologue · 8b614aeb
      Daniel Borkmann 提交于
      Back in the days where eBPF (or back then "internal BPF" ;->) was not
      exposed to user space, and only the classic BPF programs internally
      translated into eBPF programs, we missed the fact that for classic BPF
      A and X needed to be cleared. It was fixed back then via 83d5b7ef
      ("net: filter: initialize A and X registers"), and thus classic BPF
      specifics were added to the eBPF interpreter core to work around it.
      
      This added some confusion for JIT developers later on that take the
      eBPF interpreter code as an example for deriving their JIT. F.e. in
      f75298f5 ("s390/bpf: clear correct BPF accumulator register"), at
      least X could leak stack memory. Furthermore, since this is only needed
      for classic BPF translations and not for eBPF (verifier takes care
      that read access to regs cannot be done uninitialized), more complexity
      is added to JITs as they need to determine whether they deal with
      migrations or native eBPF where they can just omit clearing A/X in
      their prologue and thus reduce image size a bit, see f.e. cde66c2d
      ("s390/bpf: Only clear A and X for converted BPF programs"). In other
      cases (x86, arm64), A and X is being cleared in the prologue also for
      eBPF case, which is unnecessary.
      
      Lets move this into the BPF migration in bpf_convert_filter() where it
      actually belongs as long as the number of eBPF JITs are still few. It
      can thus be done generically; allowing us to remove the quirk from
      __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
      while reducing code duplication on this matter in current(/future) eBPF
      JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Zi Shen Lim <zlim.lnx@gmail.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Acked-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b614aeb
    • D
      bpf: add bpf_skb_load_bytes helper · 05c74e5e
      Daniel Borkmann 提交于
      When hacking tc programs with eBPF, one of the issues that come up
      from time to time is to load addresses from headers. In eBPF as in
      classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that
      extract a byte, half-word or word out of the skb data though helpers
      such as bpf_load_pointer() (interpreter case).
      
      F.e. extracting a whole IPv6 address could possibly look like ...
      
        union v6addr {
          struct {
            __u32 p1;
            __u32 p2;
            __u32 p3;
            __u32 p4;
          };
          __u8 addr[16];
        };
      
        [...]
      
        a.p1 = htonl(load_word(skb, off));
        a.p2 = htonl(load_word(skb, off +  4));
        a.p3 = htonl(load_word(skb, off +  8));
        a.p4 = htonl(load_word(skb, off + 12));
      
        [...]
      
        /* access to a.addr[...] */
      
      This work adds a complementary helper bpf_skb_load_bytes() (we also
      have bpf_skb_store_bytes()) as an alternative where the same call
      would look like from an eBPF program:
      
        ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr));
      
      Same verifier restrictions apply as in ffeedafb ("bpf: introduce
      current->pid, tgid, uid, gid, comm accessors") case, where stack memory
      access needs to be statically verified and thus guaranteed to be
      initialized in first use (otherwise verifier cannot tell whether a
      subsequent access to it is valid or not as it's runtime dependent).
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05c74e5e
  21. 18 12月, 2015 1 次提交
    • W
      net: check both type and procotol for tcp sockets · ac5cc977
      WANG Cong 提交于
      Dmitry reported the following out-of-bound access:
      
      Call Trace:
       [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
      mm/kasan/report.c:294
       [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
       [<     inline     >] SYSC_setsockopt net/socket.c:1746
       [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
       [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
      arch/x86/entry/entry_64.S:185
      
      This is because we mistake a raw socket as a tcp socket.
      We should check both sk->sk_type and sk->sk_protocol to ensure
      it is a tcp socket.
      
      Willem points out __skb_complete_tx_timestamp() needs to fix as well.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac5cc977
  22. 17 12月, 2015 1 次提交
    • H
      net: Pass ndm_state to route netlink FDB notifications. · b3379041
      Hubert Sokolowski 提交于
      Before this change applications monitoring FDB notifications
      were not able to determine whether a new FDB entry is permament
      or not:
      bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self
      bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self
      
      bridge monitor fdb
      
      f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent
      f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent
      
      With this change ndm_state from the original netlink message
      is passed to the new netlink message sent as notification.
      
      bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self
      bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self
      
      bridge monitor fdb
      f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent
      f1:f2:f3:f4:f5:f7 dev sw0p1 self static
      Signed-off-by: NHubert Sokolowski <hubert.sokolowski@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3379041
  23. 16 12月, 2015 6 次提交
    • L
      net: diag: Add the ability to destroy a socket. · 64be0aed
      Lorenzo Colitti 提交于
      This patch adds a SOCK_DESTROY operation, a destroy function
      pointer to sock_diag_handler, and a diag_destroy function
      pointer.  It does not include any implementation code.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64be0aed
    • T
      net: Add driver helper functions to determine checksum offloadability · 6ae23ad3
      Tom Herbert 提交于
      Add skb_csum_offload_chk driver helper function to determine if a
      device with limited checksum offload capabilities is able to offload the
      checksum for a given packet.
      
      This patch includes:
        - The skb_csum_offload_chk function. Returns true if checksum is
          offloadable, else false. Optionally, in the case that the checksum
          is not offloable, the function can call skb_checksum_help to resolve
          the checksum. skb_csum_offload_chk also returns whether the checksum
          refers to an encapsulated checksum.
        - Definition of skb_csum_offl_spec structure that caller uses to
          indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
          whether encapsulated checksums can be offloaded, whether checksum with
          IPv6 extension headers can be offloaded).
        - Ancilary functions called skb_csum_offload_chk_help,
          skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae23ad3
    • T
      net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM · c8cd0989
      Tom Herbert 提交于
      These netif flags are unnecessary convolutions. It is more
      straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
      and NETIF_F_IPV6_CSUM directly.
      
      This patch also:
          - Cleans up can_checksum_protocol
          - Simplifies netdev_intersect_features
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8cd0989
    • T
      net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK · a188222b
      Tom Herbert 提交于
      The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
      set of features for offloading all checksums. This is a mask of the
      checksum offload related features bits. It is incorrect to set both
      NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
      features of a device.
      
      This patch:
        - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
          NETIF_F_ALL_CSUM is being used as a mask).
        - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
          use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a188222b
    • T
      sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC · 53692b1d
      Tom Herbert 提交于
      The SCTP checksum is really a CRC and is very different from the
      standards 1's complement checksum that serves as the checksum
      for IP protocols. This offload interface is also very different.
      Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
      differences. The term CSUM should be reserved in the stack to refer
      to the standard 1's complement IP checksum.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53692b1d
    • I
      switchdev: Pass original device to port netdev driver · 6ff64f6f
      Ido Schimmel 提交于
      switchdev drivers need to know the netdev on which the switchdev op was
      invoked. For example, the STP state of a VLAN interface configured on top
      of a port can change while being member in a bridge. In this case, the
      underlying driver should only change the STP state of that particular
      VLAN and not of all the VLANs configured on the port.
      
      However, current switchdev infrastructure only passes the port netdev down
      to the driver. Solve that by passing the original device down to the
      driver as part of the required switchdev object / attribute.
      
      This doesn't entail any change in current switchdev drivers. It simply
      enables those supporting stacked devices to know the originating device
      and act accordingly.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ff64f6f