1. 30 8月, 2017 4 次提交
    • E
      neigh: increase queue_len_bytes to match wmem_default · eaa72dc4
      Eric Dumazet 提交于
      Florian reported UDP xmit drops that could be root caused to the
      too small neigh limit.
      
      Current limit is 64 KB, meaning that even a single UDP socket would hit
      it, since its default sk_sndbuf comes from net.core.wmem_default
      (~212992 bytes on 64bit arches).
      
      Once ARP/ND resolution is in progress, we should allow a little more
      packets to be queued, at least for one producer.
      
      Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
      limit and either block in sendmsg() or return -EAGAIN.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaa72dc4
    • D
      ipv6: Use rt6i_idev index for echo replies to a local address · 1b70d792
      David Ahern 提交于
      Tariq repored local pings to linklocal address is failing:
      $ ifconfig ens8
      ens8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 11.141.16.6  netmask 255.255.0.0  broadcast 11.141.255.255
              inet6 fe80::7efe:90ff:fecb:7502  prefixlen 64  scopeid 0x20<link>
              ether 7c:fe:90:cb:75:02  txqueuelen 1000  (Ethernet)
              RX packets 12  bytes 1164 (1.1 KiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 30  bytes 2484 (2.4 KiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      $  /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8
      PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b70d792
    • Y
      net: add NSH header structures and helpers · 1f0b7744
      Yi Yang 提交于
      NSH (Network Service Header)[1] is a new protocol for service
      function chaining, it can be handled as a L3 protocol like
      IPv4 and IPv6, Eth + NSH + Inner packet or VxLAN-gpe + NSH +
      Inner packet are two typical use cases.
      
      This patch adds NSH header structures and helpers for NSH GSO
      support and Open vSwitch NSH support.
      
      [1] https://datatracker.ietf.org/doc/draft-ietf-sfc-nsh/
      
      [Jiri: added nsh_hdr() helper and renamed the header struct to "struct
      nshhdr" to match the usual pattern. Removed packet type defines, these are
      now shared with VXLAN-GPE.]
      Signed-off-by: NYi Yang <yi.y.yang@intel.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f0b7744
    • J
      vxlan: factor out VXLAN-GPE next protocol · fa20e0e3
      Jiri Benc 提交于
      The values are shared between VXLAN-GPE and NSH. Originally probably by
      coincidence but I notified both working groups about this last year and they
      seem to keep the values in sync since then.
      
      Hopefully they'll get a single IANA registry for the values, too. (I asked
      them for that.)
      
      Factor out the code to be shared by the NSH implementation.
      
      NSH and MPLS values are added in this patch, too. For MPLS, the drafts
      incorrectly assign only a single value, while we have two MPLS ethertypes.
      I raised the problem with both groups. For now, I assume the value is for
      unicast.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa20e0e3
  2. 29 8月, 2017 5 次提交
    • D
      rxrpc: Allow failed client calls to be retried · c038a58c
      David Howells 提交于
      Allow a client call that failed on network error to be retried, provided
      that the Tx queue still holds DATA packet 1.  This allows an operation to
      be submitted to another server or another address for the same server
      without having to repackage and re-encrypt the data so far processed.
      
      Two new functions are provided:
      
       (1) rxrpc_kernel_check_call() - This is used to find out the completion
           state of a call to guess whether it can be retried and whether it
           should be retried.
      
       (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
           connection, reset the state and submit it as a new client call to a
           new address.  The new address need not match the previous address.
      
      A call may be retried even if all the data hasn't been loaded into it yet;
      a partially constructed will be retained at the same point it was at when
      an error condition was detected.  msg_data_left() can be used to find out
      how much data was packaged before the error occurred.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c038a58c
    • D
      rxrpc: Add notification of end-of-Tx phase · e833251a
      David Howells 提交于
      Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
      a notification that the AF_RXRPC call has transitioned out the Tx phase and
      is now waiting for a reply or a final ACK.
      
      This is called from AF_RXRPC with the call state lock held so the
      notification is guaranteed to come before any reply is passed back.
      
      Further, modify the AFS filesystem to make use of this so that we don't have
      to change the afs_call state before sending the last bit of data.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e833251a
    • S
      net/ncsi: Configure VLAN tag filter · 21acf630
      Samuel Mendoza-Jonas 提交于
      Make use of the ndo_vlan_rx_{add,kill}_vid callbacks to have the NCSI
      stack process new VLAN tags and configure the channel VLAN filter
      appropriately.
      Several VLAN tags can be set and a "Set VLAN Filter" packet must be sent
      for each one, meaning the ncsi_dev_state_config_svf state must be
      repeated. An internal list of VLAN tags is maintained, and compared
      against the current channel's ncsi_channel_filter in order to keep track
      within the state. VLAN filters are removed in a similar manner, with the
      introduction of the ncsi_dev_state_config_clear_vids state. The maximum
      number of VLAN tag filters is determined by the "Get Capabilities"
      response from the channel.
      Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21acf630
    • G
      irda: move include/net/irda into staging subdirectory · 5bf916ee
      Greg Kroah-Hartman 提交于
      And finally, move the irda include files into
      drivers/staging/irda/include/net/irda.  Yes, it's a long path, but it
      makes it easy for us to just add a Makefile directory path addition and
      all of the net and drivers code "just works".
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bf916ee
    • W
      gre: add collect_md mode to ERSPAN tunnel · 1a66a836
      William Tu 提交于
      Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to
      operate in 'collect metadata' mode.  bpf_skb_[gs]et_tunnel_key() helpers
      can make use of it right away.  OVS can use it as well in the future.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a66a836
  3. 26 8月, 2017 3 次提交
    • W
      net_sched: kill u32_node pointer in Qdisc · 3cd904ec
      WANG Cong 提交于
      It is ugly to hide a u32-filter-specific pointer inside Qdisc,
      this breaks the TC layers:
      
      1. Qdisc is a generic representation, should not have any specific
         data of any type
      
      2. Qdisc layer is above filter layer, should only save filters in
         the list of struct tcf_proto.
      
      This pointer is used as the head of the chain of u32 hash tables,
      that is struct tc_u_hnode, because u32 filter is very special,
      it allows to create multiple hash tables within one qdisc and
      across multiple u32 filters.
      
      Instead of using this ugly pointer, we can just save it in a global
      hash table key'ed by (dev ifindex, qdisc handle), therefore we can
      still treat it as a per qdisc basis data structure conceptually.
      
      Of course, because of network namespaces, this key is not unique
      at all, but it is fine as we already have a pointer to Qdisc in
      struct tc_u_common, we can just compare the pointers when collision.
      
      And this only affects slow paths, has no impact to fast path,
      thanks to the pointer ->tp_c.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3cd904ec
    • W
      net_sched: remove tc class reference counting · 143976ce
      WANG Cong 提交于
      For TC classes, their ->get() and ->put() are always paired, and the
      reference counting is completely useless, because:
      
      1) For class modification and dumping paths, we already hold RTNL lock,
         so all of these ->get(),->change(),->put() are atomic.
      
      2) For filter bindiing/unbinding, we use other reference counter than
         this one, and they should have RTNL lock too.
      
      3) For ->qlen_notify(), it is special because it is called on ->enqueue()
         path, but we already hold qdisc tree lock there, and we hold this
         tree lock when graft or delete the class too, so it should not be gone
         or changed until we release the tree lock.
      
      Therefore, this patch removes ->get() and ->put(), but:
      
      1) Adds a new ->find() to find the pointer to a class by classid, no
         refcnt.
      
      2) Move the original class destroy upon the last refcnt into ->delete(),
         right after releasing tree lock. This is fine because the class is
         already removed from hash when holding the lock.
      
      For those who also use ->put() as ->unbind(), just rename them to reflect
      this change.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143976ce
    • D
      ipv6: sr: add support for ip4ip6 encapsulation · 32d99d0b
      David Lebrun 提交于
      This patch enables the SRv6 encapsulation mode to carry an IPv4 payload.
      All the infrastructure was already present, I just had to add a parameter
      to seg6_do_srh_encap() to specify the inner packet protocol, and perform
      some additional checks.
      
      Usage example:
      ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32d99d0b
  4. 25 8月, 2017 9 次提交
    • E
      strparser: initialize all callbacks · 3fd87127
      Eric Biggers 提交于
      commit bbb03029 ("strparser: Generalize strparser") added more
      function pointers to 'struct strp_callbacks'; however, kcm_attach() was
      not updated to initialize them.  This could cause the ->lock() and/or
      ->unlock() function pointers to be set to garbage values, causing a
      crash in strp_work().
      
      Fix the bug by moving the callback structs into static memory, so
      unspecified members are zeroed.  Also constify them while we're at it.
      
      This bug was found by syzkaller, which encountered the following splat:
      
          IP: 0x55
          PGD 3b1ca067
          P4D 3b1ca067
          PUD 3b12f067
          PMD 0
      
          Oops: 0010 [#1] SMP KASAN
          Dumping ftrace buffer:
             (ftrace buffer empty)
          Modules linked in:
          CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Workqueue: kstrp strp_work
          task: ffff88006bb0e480 task.stack: ffff88006bb10000
          RIP: 0010:0x55
          RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
          RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
          RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
          RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
          R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
          R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
          FS:  0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
           worker_thread+0x223/0x1860 kernel/workqueue.c:2233
           kthread+0x35e/0x430 kernel/kthread.c:231
           ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
          Code:  Bad RIP value.
          RIP: 0x55 RSP: ffff88006bb17540
          CR2: 0000000000000055
          ---[ end trace f0e4920047069cee ]---
      
      Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
      CONFIG_AF_KCM=y):
      
          #include <linux/bpf.h>
          #include <linux/kcm.h>
          #include <linux/types.h>
          #include <stdint.h>
          #include <sys/ioctl.h>
          #include <sys/socket.h>
          #include <sys/syscall.h>
          #include <unistd.h>
      
          static const struct bpf_insn bpf_insns[3] = {
              { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
              { .code = 0x95 }, /* BPF_EXIT_INSN() */
          };
      
          static const union bpf_attr bpf_attr = {
              .prog_type = 1,
              .insn_cnt = 2,
              .insns = (uintptr_t)&bpf_insns,
              .license = (uintptr_t)"",
          };
      
          int main(void)
          {
              int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
                                   &bpf_attr, sizeof(bpf_attr));
              int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
              int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
      
              ioctl(kcm_fd, SIOCKCMATTACH,
                    &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
          }
      
      Fixes: bbb03029 ("strparser: Generalize strparser")
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fd87127
    • J
      ipv6: Compute multipath hash for ICMP errors from offending packet · 23aebdac
      Jakub Sitnicki 提交于
      When forwarding or sending out an ICMPv6 error, look at the embedded
      packet that triggered the error and compute a flow hash over its
      headers.
      
      This let's us route the ICMP error together with the flow it belongs to
      when multipath (ECMP) routing is in use, which in turn makes Path MTU
      Discovery work in ECMP load-balanced or anycast setups (RFC 7690).
      
      Granted, end-hosts behind the ECMP router (aka servers) need to reflect
      the IPv6 Flow Label for PMTUD to work.
      
      The code is organized to be in parallel with ipv4 stack:
      
        ip_multipath_l3_keys -> ip6_multipath_l3_keys
        fib_multipath_hash   -> rt6_multipath_hash
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23aebdac
    • J
      net: Extend struct flowi6 with multipath hash · 29825717
      Jakub Sitnicki 提交于
      Allow for functions that fill out the IPv6 flow info to also pass a hash
      computed over the skb contents. The hash value will drive the multipath
      routing decisions.
      
      This is intended for special treatment of ICMPv6 errors, where we would
      like to make a routing decision based on the flow identifying the
      offending IPv6 datagram that triggered the error, rather than the flow
      of the ICMP error itself.
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29825717
    • D
      devlink: Fix devlink_dpipe_table_register() stub signature. · 790c6056
      David S. Miller 提交于
      One too many arguments compared to the non-stub version.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Fixes: ffd3cdcc ("devlink: Add support for dynamic table size")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      790c6056
    • J
      ipv6: Add sysctl for per namespace flow label reflection · 22b6722b
      Jakub Sitnicki 提交于
      Reflecting IPv6 Flow Label at server nodes is useful in environments
      that employ multipath routing to load balance the requests. As "IPv6
      Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error
      messages generated in response to a downstream packets from the server
      can be routed by a load balancer back to the original server without
      looking at transport headers, if the server applies the flow label
      reflection. This enables the Path MTU Discovery past the ECMP router in
      load-balance or anycast environments where each server node is reachable
      by only one path.
      
      Introduce a sysctl to enable flow label reflection per net namespace for
      all newly created sockets. Same could be earlier achieved only per
      socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR
      socket option.
      
      [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22b6722b
    • A
      devlink: Move dpipe entry clear function into devlink · 35807324
      Arkadi Sharshevsky 提交于
      The entry clear routine can be shared between the drivers, thus it is
      moved inside devlink.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35807324
    • A
      devlink: Add support for dynamic table size · ffd3cdcc
      Arkadi Sharshevsky 提交于
      Up until now the dpipe table's size was static and known at registration
      time. The host table does not have constant size and it is resized in
      dynamic manner. In order to support this behavior the size is changed
      to be obtained dynamically via an op.
      
      This patch also adjust the current dpipe table for the new API.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffd3cdcc
    • A
      devlink: Add IPv4 header for dpipe · 3fb886ec
      Arkadi Sharshevsky 提交于
      This will be used by the IPv4 host table which will be introduced in the
      following patches. This header is global and can be reused by many
      drivers.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb886ec
    • A
      devlink: Add Ethernet header for dpipe · 11770091
      Arkadi Sharshevsky 提交于
      This will be used by the IPv4 host table which will be introduced in the
      following patches. This header is global and can be reused by many
      drivers.
      Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11770091
  5. 24 8月, 2017 3 次提交
  6. 23 8月, 2017 2 次提交
  7. 22 8月, 2017 2 次提交
  8. 21 8月, 2017 1 次提交
  9. 19 8月, 2017 2 次提交
    • E
      ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t · 9620fef2
      Eric Dumazet 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9620fef2
    • M
      datagram: When peeking datagrams with offset < 0 don't skip empty skbs · a0917e0b
      Matthew Dawson 提交于
      Due to commit e6afc8ac ("udp: remove
      headers from UDP packets before queueing"), when udp packets are being
      peeked the requested extra offset is always 0 as there is no need to skip
      the udp header.  However, when the offset is 0 and the next skb is
      of length 0, it is only returned once.  The behaviour can be seen with
      the following python script:
      
      from socket import *;
      f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      f.bind(('::', 0));
      addr=('::1', f.getsockname()[1]);
      g.sendto(b'', addr)
      g.sendto(b'b', addr)
      print(f.recvfrom(10, MSG_PEEK));
      print(f.recvfrom(10, MSG_PEEK));
      
      Where the expected output should be the empty string twice.
      
      Instead, make sk_peek_offset return negative values, and pass those values
      to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
      to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
      __skb_try_recv_from_queue will then ensure the offset is reset back to 0
      if a peek is requested without an offset, unless no packets are found.
      
      Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
      greater then 0, and off is greater then or equal to skb->len, then
      (_off || skb->len) must always be true assuming skb->len >= 0 is always
      true.
      
      Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
      as it double checked if MSG_PEEK was set in the flags.
      
      V2:
       - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
      redundant checks
       - Fix peeking in udp{,v6}_recvmsg to report the right value when the
      offset is 0
      
      V3:
       - Marked new branch in __skb_try_recv_from_queue as unlikely.
      Signed-off-by: NMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0917e0b
  10. 17 8月, 2017 1 次提交
    • E
      ipv4: better IP_MAX_MTU enforcement · c780a049
      Eric Dumazet 提交于
      While working on yet another syzkaller report, I found
      that our IP_MAX_MTU enforcements were not properly done.
      
      gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
      final result can be bigger than IP_MAX_MTU :/
      
      This is a problem because device mtu can be changed on other cpus or
      threads.
      
      While this patch does not fix the issue I am working on, it is
      probably worth addressing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c780a049
  11. 16 8月, 2017 2 次提交
    • E
      ipv6: fix NULL dereference in ip6_route_dev_notify() · 12d94a80
      Eric Dumazet 提交于
      Based on a syzkaller report [1], I found that a per cpu allocation
      failure in snmp6_alloc_dev() would then lead to NULL dereference in
      ip6_route_dev_notify().
      
      It seems this is a very old bug, thus no Fixes tag in this submission.
      
      Let's add in6_dev_put_clear() helper, as we will probably use
      it elsewhere (once available/present in net-next)
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 1 PID: 17294 Comm: syz-executor6 Not tainted 4.13.0-rc2+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff88019f456680 task.stack: ffff8801c6e58000
      RIP: 0010:__read_once_size include/linux/compiler.h:250 [inline]
      RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
      RIP: 0010:refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178
      RSP: 0018:ffff8801c6e5f1b0 EFLAGS: 00010202
      RAX: 0000000000000037 RBX: dffffc0000000000 RCX: ffffc90005d25000
      RDX: ffff8801c6e5f218 RSI: ffffffff82342bbf RDI: 0000000000000001
      RBP: ffff8801c6e5f240 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff10038dcbe37
      R13: 0000000000000006 R14: 0000000000000001 R15: 00000000000001b8
      FS:  00007f21e0429700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001ddbc22000 CR3: 00000001d632b000 CR4: 00000000001426e0
      DR0: 0000000020000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Call Trace:
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       in6_dev_put include/net/addrconf.h:335 [inline]
       ip6_route_dev_notify+0x1c9/0x4a0 net/ipv6/route.c:3732
       notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394 [inline]
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1678
       call_netdevice_notifiers net/core/dev.c:1694 [inline]
       rollback_registered_many+0x91c/0xe80 net/core/dev.c:7107
       rollback_registered+0x1be/0x3c0 net/core/dev.c:7149
       register_netdevice+0xbcd/0xee0 net/core/dev.c:7587
       register_netdev+0x1a/0x30 net/core/dev.c:7669
       loopback_net_init+0x76/0x160 drivers/net/loopback.c:214
       ops_init+0x10a/0x570 net/core/net_namespace.c:118
       setup_net+0x313/0x710 net/core/net_namespace.c:294
       copy_net_ns+0x27c/0x580 net/core/net_namespace.c:418
       create_new_namespaces+0x425/0x880 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
       SYSC_unshare kernel/fork.c:2347 [inline]
       SyS_unshare+0x653/0xfa0 kernel/fork.c:2297
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x4512c9
      RSP: 002b:00007f21e0428c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000110
      RAX: ffffffffffffffda RBX: 0000000000718150 RCX: 00000000004512c9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000062020200
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b973d
      R13: 00000000ffffffff R14: 000000002001d000 R15: 00000000000002dd
      Code: 50 2b 34 82 c7 00 f1 f1 f1 f1 c7 40 04 04 f2 f2 f2 c7 40 08 f3 f3
      f3 f3 e8 a1 43 39 ff 4c 89 f8 48 8b 95 70 ff ff ff 48 c1 e8 03 <0f> b6
      0c 18 4c 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
      RIP: __read_once_size include/linux/compiler.h:250 [inline] RSP:
      ffff8801c6e5f1b0
      RIP: atomic_read arch/x86/include/asm/atomic.h:26 [inline] RSP:
      ffff8801c6e5f1b0
      RIP: refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178 RSP:
      ffff8801c6e5f1b0
      ---[ end trace e441d046c6410d31 ]---
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12d94a80
    • I
      ipv6: fib: Provide offload indication using nexthop flags · fe400799
      Ido Schimmel 提交于
      IPv6 routes currently lack nexthop flags as in IPv4. This has several
      implications.
      
      In the forwarding path, it requires us to check the carrier state of the
      nexthop device and potentially ignore a linkdown route, instead of
      checking for RTNH_F_LINKDOWN.
      
      It also requires capable drivers to use the user facing IPv6-specific
      route flags to provide offload indication, instead of using the nexthop
      flags as in IPv4.
      
      Add nexthop flags to IPv6 routes in the 40 bytes hole and use it to
      provide offload indication instead of the RTF_OFFLOAD flag, which is
      removed while it's still not part of any official kernel release.
      
      In the near future we would like to use the field for the
      RTNH_F_{LINKDOWN,DEAD} flags, but this change is more involved and might
      not be ready in time for the current cycle.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe400799
  12. 15 8月, 2017 1 次提交
  13. 12 8月, 2017 5 次提交
    • E
      udp: harden copy_linear_skb() · fd851ba9
      Eric Dumazet 提交于
      syzkaller got crashes with CONFIG_HARDENED_USERCOPY=y configs.
      
      Issue here is that recvfrom() can be used with user buffer of Z bytes,
      and SO_PEEK_OFF of X bytes, from a skb with Y bytes, and following
      condition :
      
      Z < X < Y
      
      kernel BUG at mm/usercopy.c:72!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2917 Comm: syzkaller842281 Not tainted 4.13.0-rc3+ #16
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      task: ffff8801d2fa40c0 task.stack: ffff8801d1fe8000
      RIP: 0010:report_usercopy mm/usercopy.c:64 [inline]
      RIP: 0010:__check_object_size+0x3ad/0x500 mm/usercopy.c:264
      RSP: 0018:ffff8801d1fef8a8 EFLAGS: 00010286
      RAX: 0000000000000078 RBX: ffffffff847102c0 RCX: 0000000000000000
      RDX: 0000000000000078 RSI: 1ffff1003a3fded5 RDI: ffffed003a3fdf09
      RBP: ffff8801d1fef998 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d1ea480e
      R13: fffffffffffffffa R14: ffffffff84710280 R15: dffffc0000000000
      FS:  0000000001360880(0000) GS:ffff8801dc000000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000202ecfe4 CR3: 00000001d1ff8000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       check_object_size include/linux/thread_info.h:108 [inline]
       check_copy_size include/linux/thread_info.h:139 [inline]
       copy_to_iter include/linux/uio.h:105 [inline]
       copy_linear_skb include/net/udp.h:371 [inline]
       udpv6_recvmsg+0x1040/0x1af0 net/ipv6/udp.c:395
       inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
       sock_recvmsg_nosec net/socket.c:792 [inline]
       sock_recvmsg+0xc9/0x110 net/socket.c:799
       SYSC_recvfrom+0x2d6/0x570 net/socket.c:1788
       SyS_recvfrom+0x40/0x50 net/socket.c:1760
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd851ba9
    • D
      net: fix compilation when busy poll is not enabled · e4dde412
      Daniel Borkmann 提交于
      MIN_NAPI_ID is used in various places outside of
      CONFIG_NET_RX_BUSY_POLL wrapping, so when it's not set
      we run into build errors such as:
      
        net/core/dev.c: In function 'dev_get_by_napi_id':
        net/core/dev.c:886:16: error: ‘MIN_NAPI_ID’ undeclared (first use in this function)
          if (napi_id < MIN_NAPI_ID)
                        ^~~~~~~~~~~
      
      Thus, have MIN_NAPI_ID always defined to fix these errors.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4dde412
    • A
      bonding: require speed/duplex only for 802.3ad, alb and tlb · ad729bc9
      Andreas Born 提交于
      The patch c4adfc82 ("bonding: make speed, duplex setting consistent
      with link state") puts the link state to down if
      bond_update_speed_duplex() cannot retrieve speed and duplex settings.
      Assumably the patch was written with 802.3ad mode in mind which relies
      on link speed/duplex settings. For other modes like active-backup these
      settings are not required. Thus, only for these other modes, this patch
      reintroduces support for slaves that do not support reporting speed or
      duplex such as wireless devices. This fixes the regression reported in
      bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547).
      
      Fixes: c4adfc82 ("bonding: make speed, duplex setting consistent
      with link state")
      Signed-off-by: NAndreas Born <futur.andy@googlemail.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad729bc9
    • J
      net: sched: remove cops->tcf_cl_offload · 7b06e8ae
      Jiri Pirko 提交于
      cops->tcf_cl_offload is no longer needed, as the drivers check what they
      can and cannot offload using the classid identify helpers. So remove this.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b06e8ae
    • J
      net: sched: remove handle propagation down to the drivers · 237f79d2
      Jiri Pirko 提交于
      There is no longer need to use handle in drivers, so remove it from
      tc_cls_common_offload struct.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      237f79d2