1. 30 12月, 2017 4 次提交
    • E
      af_key: fix buffer overread in parse_exthdrs() · 4e765b49
      Eric Biggers 提交于
      If a message sent to a PF_KEY socket ended with an incomplete extension
      header (fewer than 4 bytes remaining), then parse_exthdrs() read past
      the end of the message, into uninitialized memory.  Fix it by returning
      -EINVAL in this case.
      
      Reproducer:
      
      	#include <linux/pfkeyv2.h>
      	#include <sys/socket.h>
      	#include <unistd.h>
      
      	int main()
      	{
      		int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
      		char buf[17] = { 0 };
      		struct sadb_msg *msg = (void *)buf;
      
      		msg->sadb_msg_version = PF_KEY_V2;
      		msg->sadb_msg_type = SADB_DELETE;
      		msg->sadb_msg_len = 2;
      
      		write(sock, buf, 17);
      	}
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      4e765b49
    • E
      af_key: fix buffer overread in verify_address_len() · 06b335cb
      Eric Biggers 提交于
      If a message sent to a PF_KEY socket ended with one of the extensions
      that takes a 'struct sadb_address' but there were not enough bytes
      remaining in the message for the ->sa_family member of the 'struct
      sockaddr' which is supposed to follow, then verify_address_len() read
      past the end of the message, into uninitialized memory.  Fix it by
      returning -EINVAL in this case.
      
      This bug was found using syzkaller with KMSAN.
      
      Reproducer:
      
      	#include <linux/pfkeyv2.h>
      	#include <sys/socket.h>
      	#include <unistd.h>
      
      	int main()
      	{
      		int sock = socket(PF_KEY, SOCK_RAW, PF_KEY_V2);
      		char buf[24] = { 0 };
      		struct sadb_msg *msg = (void *)buf;
      		struct sadb_address *addr = (void *)(msg + 1);
      
      		msg->sadb_msg_version = PF_KEY_V2;
      		msg->sadb_msg_type = SADB_DELETE;
      		msg->sadb_msg_len = 3;
      		addr->sadb_address_len = 1;
      		addr->sadb_address_exttype = SADB_EXT_ADDRESS_SRC;
      
      		write(sock, buf, 24);
      	}
      Reported-by: NAlexander Potapenko <glider@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      06b335cb
    • F
      xfrm: skip policies marked as dead while rehashing · 862591bf
      Florian Westphal 提交于
      syzkaller triggered following KASAN splat:
      
      BUG: KASAN: slab-out-of-bounds in xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
      read of size 2 at addr ffff8801c8e92fe4 by task kworker/1:1/23 [..]
      Workqueue: events xfrm_hash_rebuild [..]
       __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
       xfrm_hash_rebuild+0xdbe/0xf00 net/xfrm/xfrm_policy.c:618
       process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112
       worker_thread+0x223/0x1990 kernel/workqueue.c:2246 [..]
      
      The reproducer triggers:
      1016                 if (error) {
      1017                         list_move_tail(&walk->walk.all, &x->all);
      1018                         goto out;
      1019                 }
      
      in xfrm_policy_walk() via pfkey (it sets tiny rcv space, dump
      callback returns -ENOBUFS).
      
      In this case, *walk is located the pfkey socket struct, so this socket
      becomes visible in the global policy list.
      
      It looks like this is intentional -- phony walker has walk.dead set to 1
      and all other places skip such "policies".
      
      Ccing original authors of the two commits that seem to expose this
      issue (first patch missed ->dead check, second patch adds pfkey
      sockets to policies dumper list).
      
      Fixes: 880a6fab ("xfrm: configure policy hash table thresholds by netlink")
      Fixes: 12a169e7 ("ipsec: Put dumpers on the dump list")
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Timo Teras <timo.teras@iki.fi>
      Cc: Christophe Gouault <christophe.gouault@6wind.com>
      Reported-by: Nsyzbot <bot+c028095236fcb6f4348811565b75084c754dc729@syzkaller.appspotmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      862591bf
    • H
      xfrm: Forbid state updates from changing encap type · 257a4b01
      Herbert Xu 提交于
      Currently we allow state updates to competely replace the contents
      of x->encap.  This is bad because on the user side ESP only sets up
      header lengths depending on encap_type once when the state is first
      created.  This could result in the header lengths getting out of
      sync with the actual state configuration.
      
      In practice key managers will never do a state update to change the
      encapsulation type.  Only the port numbers need to be changed as the
      peer NAT entry is updated.
      
      Therefore this patch adds a check in xfrm_state_update to forbid
      any changes to the encap_type.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      257a4b01
  2. 29 12月, 2017 3 次提交
  3. 28 12月, 2017 1 次提交
  4. 27 12月, 2017 8 次提交
    • T
      tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path · 642a8439
      Tommi Rantala 提交于
      Calling tipc_mon_delete() before the monitor has been created will oops.
      This can happen in tipc_enable_bearer() error path if tipc_disc_create()
      fails.
      
      [   48.589074] BUG: unable to handle kernel paging request at 0000000000001008
      [   48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc]
      [   48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0
      [   48.592230] Oops: 0000 [#1] SMP KASAN
      [   48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G    B            4.15.0-rc4-pc64-dirty #5
      [   48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      [   48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc]
      [   48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282
      [   48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000
      [   48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297
      [   48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533
      [   48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050
      [   48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020
      [   48.608228] FS:  00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000
      [   48.610189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0
      [   48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   48.615038] Call Trace:
      [   48.615635]  tipc_enable_bearer+0x415/0x5e0 [tipc]
      [   48.620623]  tipc_nl_bearer_enable+0x1ab/0x200 [tipc]
      [   48.625118]  genl_family_rcv_msg+0x36b/0x570
      [   48.631233]  genl_rcv_msg+0x5a/0xa0
      [   48.631867]  netlink_rcv_skb+0x1cc/0x220
      [   48.636373]  genl_rcv+0x24/0x40
      [   48.637306]  netlink_unicast+0x29c/0x350
      [   48.639664]  netlink_sendmsg+0x439/0x590
      [   48.642014]  SYSC_sendto+0x199/0x250
      [   48.649912]  do_syscall_64+0xfd/0x2c0
      [   48.650651]  entry_SYSCALL64_slow_path+0x25/0x25
      [   48.651843] RIP: 0033:0x7fc6859848e3
      [   48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3
      [   48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003
      [   48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c
      [   48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010
      [   48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000
      [   48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668
      [   48.664073] CR2: 0000000000001008
      [   48.664576] ---[ end trace e811818d54d5ce88 ]---
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      642a8439
    • T
      tipc: error path leak fixes in tipc_enable_bearer() · 19142551
      Tommi Rantala 提交于
      Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
      cleanup with bearer_disable() if tipc_mon_create() fails.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19142551
    • A
      RDS: Check cmsg_len before dereferencing CMSG_DATA · 14e138a8
      Avinash Repaka 提交于
      RDS currently doesn't check if the length of the control message is
      large enough to hold the required data, before dereferencing the control
      message data. This results in following crash:
      
      BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
      [inline]
      BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
      net/rds/send.c:1066
      Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157
      
      CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
       rds_rdma_bytes net/rds/send.c:1013 [inline]
       rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
       sock_sendmsg_nosec net/socket.c:628 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:638
       ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
       __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
       SYSC_sendmmsg net/socket.c:2139 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2134
       entry_SYSCALL_64_fastpath+0x1f/0x96
      RIP: 0033:0x43fe49
      RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
      RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
      R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000
      
      To fix this, we verify that the cmsg_len is large enough to hold the
      data to be read, before proceeding further.
      Reported-by: Nsyzbot <syzkaller-bugs@googlegroups.com>
      Signed-off-by: NAvinash Repaka <avinash.repaka@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14e138a8
    • J
      tipc: fix memory leak of group member when peer node is lost · 3a33a19b
      Jon Maloy 提交于
      When a group member receives a member WITHDRAW event, this might have
      two reasons: either the peer member is leaving the group, or the link
      to the member's node has been lost.
      
      In the latter case we need to issue a DOWN event to the user right away,
      and let function tipc_group_filter_msg() perform delete of the member
      item. However, in this case we miss to change the state of the member
      item to MBR_LEAVING, so the member item is not deleted, and we have a
      memory leak.
      
      We now separate better between the four sub-cases of a WITHRAW event
      and make sure that each case is handled correctly.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a33a19b
    • J
      net: sched: fix possible null pointer deref in tcf_block_put · 4853f128
      Jiri Pirko 提交于
      We need to check block for being null in both tcf_block_put and
      tcf_block_put_ext.
      
      Fixes: 343723dd ("net: sched: fix clsact init error path")
      Reported-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4853f128
    • J
      tipc: base group replicast ack counter on number of actual receivers · 0a3d805c
      Jon Maloy 提交于
      In commit 2f487712 ("tipc: guarantee that group broadcast doesn't
      bypass group unicast") we introduced a mechanism that requires the first
      (replicated) broadcast sent after a unicast to be acknowledged by all
      receivers before permitting sending of the next (true) broadcast.
      
      The counter for keeping track of the number of acknowledges to expect
      is based on the tipc_group::member_cnt variable. But this misses that
      some of the known members may not be ready for reception, and will never
      acknowledge the message, either because they haven't fully joined the
      group or because they are leaving the group. Such members are identified
      by not fulfilling the condition tested for in the function
      tipc_group_is_enabled().
      
      We now set the counter for the actual number of acks to receive at the
      moment the message is sent, by just counting the number of recipients
      satisfying the tipc_group_is_enabled() test.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a3d805c
    • C
      net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap() · b2fb01f4
      Cong Wang 提交于
      The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
      flying RCU callback installed by a previous mini_qdisc_pair_swap(),
      however we miss it on the tp_head==NULL path, which leads to that
      the RCU callback still uses miniq_old->rcu after it is freed together
      with qdisc in qdisc_graft(). So just add it on that path too.
      
      Fixes: 46209401 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ")
      Reported-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Tested-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2fb01f4
    • A
      ip6_gre: fix device features for ioctl setup · e5a9336a
      Alexey Kodanev 提交于
      When ip6gre is created using ioctl, its features, such as
      scatter-gather, GSO and tx-checksumming will be turned off:
      
        # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: off
          scatter-gather: off
          tcp-segmentation-offload: off
          generic-segmentation-offload: off [requested on]
      
      But when netlink is used, they will be enabled:
        # ip link add gre6 type ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: on
          scatter-gather: on
          tcp-segmentation-offload: on
          generic-segmentation-offload: on
      
      This results in a loss of performance when gre6 is created via ioctl.
      The issue was found with LTP/gre tests.
      
      Fix it by moving the setup of device features to a separate function
      and invoke it with ndo_init callback because both netlink and ioctl
      will eventually call it via register_netdevice():
      
         register_netdevice()
             - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
                 - ip6gre_tunnel_init_common()
                      - ip6gre_tnl_init_features()
      
      The moved code also contains two minor style fixes:
        * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
        * fixed the issue reported by checkpatch: "Unnecessary parentheses around
          'nt->encap.type == TUNNEL_ENCAP_NONE'"
      
      Fixes: ac4eb009 ("ip6gre: Add support for basic offloads offloads excluding GSO")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5a9336a
  5. 22 12月, 2017 5 次提交
    • W
      skbuff: skb_copy_ubufs must release uarg even without user frags · b90ddd56
      Willem de Bruijn 提交于
      skb_copy_ubufs creates a private copy of frags[] to release its hold
      on user frags, then calls uarg->callback to notify the owner.
      
      Call uarg->callback even when no frags exist. This edge case can
      happen when zerocopy_sg_from_iter finds enough room in skb_headlen
      to copy all the data.
      
      Fixes: 3ece7826 ("sock: skb_copy_ubufs support for compound pages")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b90ddd56
    • W
      skbuff: orphan frags before zerocopy clone · 268b7906
      Willem de Bruijn 提交于
      Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
      calls to skb_uarg(skb)->callback for the same data.
      
      skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
      with each segment. This is only safe for uargs that do refcounting,
      which is those that pass skb_orphan_frags without dropping their
      shared frags. For others, skb_orphan_frags drops the user frags and
      sets the uarg to NULL, after which sock_zerocopy_clone has no effect.
      
      Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
      calls for the same data causing the vhost_net_ubuf_ref_>refcount to
      drop below zero.
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com>
      Fixes: 1f8b977a ("sock: enable MSG_ZEROCOPY")
      Reported-by: NAndreas Hartmann <andihartmann@01019freenet.de>
      Reported-by: NDavid Hill <dhill@redhat.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      268b7906
    • S
      net: reevalulate autoflowlabel setting after sysctl setting · 513674b5
      Shaohua Li 提交于
      sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
      If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
      supposed to not include flowlabel. This is true for normal packet, but
      not for reset packet.
      
      The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
      we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
      changed, so the sock will keep the old behavior in terms of auto
      flowlabel. Reset packet is suffering from this problem, because reset
      packet is sent from a special control socket, which is created at boot
      time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
      socket will always have its ipv6_pinfo.autoflowlabel set, even after
      user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
      have flowlabel. Normal sock created before sysctl setting suffers from
      the same issue. We can't even turn off autoflowlabel unless we kill all
      socks in the hosts.
      
      To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
      autoflowlabel setting from user, otherwise we always call
      ip6_default_np_autolabel() which has the new settings of sysctl.
      
      Note, this changes behavior a little bit. Before commit 42240901
      (ipv6: Implement different admin modes for automatic flow labels), the
      autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
      existing connection will change autoflowlabel behavior. After that
      commit, autoflowlabel behavior is sticky in the whole life of the sock.
      With this patch, the behavior isn't sticky again.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      513674b5
    • E
      openvswitch: Fix pop_vlan action for double tagged frames · c48e7473
      Eric Garver 提交于
      skb_vlan_pop() expects skb->protocol to be a valid TPID for double
      tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
      shift the true ethertype into position for us.
      
      Fixes: 5108bbad ("openvswitch: add processing of L3 packets")
      Signed-off-by: NEric Garver <e@erig.me>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c48e7473
    • I
      ipv6: Honor specified parameters in fibmatch lookup · 58acfd71
      Ido Schimmel 提交于
      Currently, parameters such as oif and source address are not taken into
      account during fibmatch lookup. Example (IPv4 for reference) before
      patch:
      
      $ ip -4 route show
      192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
      198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1
      
      $ ip -6 route show
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium
      fe80::/64 dev dummy0 proto kernel metric 256 pref medium
      fe80::/64 dev dummy1 proto kernel metric 256 pref medium
      
      $ ip -4 route get fibmatch 192.0.2.2 oif dummy0
      192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
      $ ip -4 route get fibmatch 192.0.2.2 oif dummy1
      RTNETLINK answers: No route to host
      
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      
      After:
      
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
      2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
      $ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
      RTNETLINK answers: Network is unreachable
      
      The problem stems from the fact that the necessary route lookup flags
      are not set based on these parameters.
      
      Instead of duplicating the same logic for fibmatch, we can simply
      resolve the original route from its copy and dump it instead.
      
      Fixes: 18c3a61c ("net: ipv6: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58acfd71
  6. 21 12月, 2017 4 次提交
    • I
      ipv4: Fix use-after-free when flushing FIB tables · b4681c28
      Ido Schimmel 提交于
      Since commit 0ddcf43d ("ipv4: FIB Local/MAIN table collapse") the
      local table uses the same trie allocated for the main table when custom
      rules are not in use.
      
      When a net namespace is dismantled, the main table is flushed and freed
      (via an RCU callback) before the local table. In case the callback is
      invoked before the local table is iterated, a use-after-free can occur.
      
      Fix this by iterating over the FIB tables in reverse order, so that the
      main table is always freed after the local table.
      
      v3: Reworded comment according to Alex's suggestion.
      v2: Add a comment to make the fix more explicit per Dave's and Alex's
      feedback.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4681c28
    • J
      tipc: remove joining group member from congested list · bb25c385
      Jon Maloy 提交于
      When we receive a JOIN message from a peer member, the message may
      contain an advertised window value ADV_IDLE that permits removing the
      member in question from the tipc_group::congested list. However, since
      the removal has been made conditional on that the advertised window is
      *not* ADV_IDLE, we miss this case. This has the effect that a sender
      sometimes may enter a state of permanent, false, broadcast congestion.
      
      We fix this by unconditinally removing the member from the congested
      list before calling tipc_member_update(), which might potentially sort
      it into the list again.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb25c385
    • J
      cls_bpf: fix offload assumptions after callback conversion · 102740bd
      Jakub Kicinski 提交于
      cls_bpf used to take care of tracking what offload state a filter
      is in, i.e. it would track if offload request succeeded or not.
      This information would then be used to issue correct requests to
      the driver, e.g. requests for statistics only on offloaded filters,
      removing only filters which were offloaded, using add instead of
      replace if previous filter was not added etc.
      
      This tracking of offload state no longer functions with the new
      callback infrastructure.  There could be multiple entities trying
      to offload the same filter.
      
      Throw out all the tracking and corresponding commands and simply
      pass to the drivers both old and new bpf program.  Drivers will
      have to deal with offload state tracking by themselves.
      
      Fixes: 3f7889c4 ("net: sched: cls_bpf: call block callbacks for offload")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      102740bd
    • E
      net: Fix double free and memory corruption in get_net_ns_by_id() · 21b59443
      Eric W. Biederman 提交于
      (I can trivially verify that that idr_remove in cleanup_net happens
       after the network namespace count has dropped to zero --EWB)
      
      Function get_net_ns_by_id() does not check for net::count
      after it has found a peer in netns_ids idr.
      
      It may dereference a peer, after its count has already been
      finaly decremented. This leads to double free and memory
      corruption:
      
      put_net(peer)                                   rtnl_lock()
      atomic_dec_and_test(&peer->count) [count=0]     ...
      __put_net(peer)                                 get_net_ns_by_id(net, id)
        spin_lock(&cleanup_list_lock)
        list_add(&net->cleanup_list, &cleanup_list)
        spin_unlock(&cleanup_list_lock)
      queue_work()                                      peer = idr_find(&net->netns_ids, id)
        |                                               get_net(peer) [count=1]
        |                                               ...
        |                                               (use after final put)
        v                                               ...
        cleanup_net()                                   ...
          spin_lock(&cleanup_list_lock)                 ...
          list_replace_init(&cleanup_list, ..)          ...
          spin_unlock(&cleanup_list_lock)               ...
          ...                                           ...
          ...                                           put_net(peer)
          ...                                             atomic_dec_and_test(&peer->count) [count=0]
          ...                                               spin_lock(&cleanup_list_lock)
          ...                                               list_add(&net->cleanup_list, &cleanup_list)
          ...                                               spin_unlock(&cleanup_list_lock)
          ...                                             queue_work()
          ...                                           rtnl_unlock()
          rtnl_lock()                                   ...
          for_each_net(tmp) {                           ...
            id = __peernet2id(tmp, peer)                ...
            spin_lock_irq(&tmp->nsid_lock)              ...
            idr_remove(&tmp->netns_ids, id)             ...
            ...                                         ...
            net_drop_ns()                               ...
      	net_free(peer)                            ...
          }                                             ...
        |
        v
        cleanup_net()
          ...
          (Second free of peer)
      
      Also, put_net() on the right cpu may reorder with left's cpu
      list_replace_init(&cleanup_list, ..), and then cleanup_list
      will be corrupted.
      
      Since cleanup_net() is executed in worker thread, while
      put_net(peer) can happen everywhere, there should be
      enough time for concurrent get_net_ns_by_id() to pick
      the peer up, and the race does not seem to be unlikely.
      The patch fixes the problem in standard way.
      
      (Also, there is possible problem in peernet2id_alloc(), which requires
      check for net::count under nsid_lock and maybe_get_net(peer), but
      in current stable kernel it's used under rtnl_lock() and it has to be
      safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
      be fixed too. While this is not in stable kernel yet, so I'll send
      a separate message to netdev@ later).
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Fixes: 0c7aecd4 "netns: add rtnl cmd to add and get peer netns ids"
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21b59443
  7. 20 12月, 2017 5 次提交
    • P
      ipv4: fib: Fix metrics match when deleting a route · d03a4557
      Phil Sutter 提交于
      The recently added fib_metrics_match() causes a regression for routes
      with both RTAX_FEATURES and RTAX_CC_ALGO if the latter has
      TCP_CONG_NEEDS_ECN flag set:
      
      | # ip link add d0 type dummy
      | # ip link set d0 up
      | # ip route add 172.29.29.0/24 dev d0 features ecn congctl dctcp
      | # ip route del 172.29.29.0/24 dev d0 features ecn congctl dctcp
      | RTNETLINK answers: No such process
      
      During route insertion, fib_convert_metrics() detects that the given CC
      algo requires ECN and hence sets DST_FEATURE_ECN_CA bit in
      RTAX_FEATURES.
      
      During route deletion though, fib_metrics_match() compares stored
      RTAX_FEATURES value with that from userspace (which obviously has no
      knowledge about DST_FEATURE_ECN_CA) and fails.
      
      Fixes: 5f9ae3d9 ("ipv4: do metrics match when looking up and deleting a route")
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d03a4557
    • J
      tipc: fix list sorting bug in function tipc_group_update_member() · 3db09601
      Jon Maloy 提交于
      When, during a join operation, or during message transmission, a group
      member needs to be added to the group's 'congested' list, we sort it
      into the list in ascending order, according to its current advertised
      window size. However, we miss the case when the member is already on
      that list. This will have the result that the member, after the window
      size has been decremented, might be at the wrong position in that list.
      This again may have the effect that we during broadcast and multicast
      transmissions miss the fact that a destination is not yet ready for
      reception, and we end up sending anyway. From this point on, the
      behavior during the remaining session is unpredictable, e.g., with
      underflowing window sizes.
      
      We now correct this bug by unconditionally removing the member from
      the list before (re-)sorting it in.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3db09601
    • X
      ip6_tunnel: get the min mtu properly in ip6_tnl_xmit · c9fefa08
      Xin Long 提交于
      Now it's using IPV6_MIN_MTU as the min mtu in ip6_tnl_xmit, but
      IPV6_MIN_MTU actually only works when the inner packet is ipv6.
      
      With IPV6_MIN_MTU for ipv4 packets, the new pmtu for inner dst
      couldn't be set less than 1280. It would cause tx_err and the
      packet to be dropped when the outer dst pmtu is close to 1280.
      
      Jianlin found it by running ipv4 traffic with the topo:
      
        (client) gre6 <---> eth1 (route) eth2 <---> gre6 (server)
      
      After changing eth2 mtu to 1300, the performance became very
      low, or the connection was even broken. The issue also affects
      ip4ip6 and ip6ip6 tunnels.
      
      So if the inner packet is ipv4, 576 should be considered as the
      min mtu.
      
      Note that for ip4ip6 and ip6ip6 tunnels, the inner packet can
      only be ipv4 or ipv6, but for gre6 tunnel, it may also be ARP.
      This patch using 576 as the min mtu for non-ipv6 packet works
      for all those cases.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9fefa08
    • X
      ip6_gre: remove the incorrect mtu limit for ipgre tap · 2c52129a
      Xin Long 提交于
      The same fix as the patch "ip_gre: remove the incorrect mtu limit for
      ipgre tap" is also needed for ip6_gre.
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c52129a
    • X
      ip_gre: remove the incorrect mtu limit for ipgre tap · cfddd4c3
      Xin Long 提交于
      ipgre tap driver calls ether_setup(), after commit 61e84623
      ("net: centralize net_device min/max MTU checking"), the range
      of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
      
      It causes the dev mtu of the ipgre tap device to not be greater
      than 1500, this limit value is not correct for ipgre tap device.
      
      Besides, it's .change_mtu already does the right check. So this
      patch is just to set max_mtu as 0, and leave the check to it's
      .change_mtu.
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfddd4c3
  8. 19 12月, 2017 8 次提交
    • J
      cfg80211: ship certificates as hex files · 04a7279f
      Johannes Berg 提交于
      Not only does this remove the need for the hexdump code in most
      normal kernel builds (still there for the extra directory), but
      it also removes the need to ship binary files, which apparently
      is somewhat problematic, as Randy reported.
      
      While at it, also add the generated files to clean-files.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      04a7279f
    • T
      cfg80211: always rewrite generated files from scratch · 5d324073
      Thierry Reding 提交于
      Currently the certs C code generation appends to the generated files,
      which is most likely a leftover from commit 715a1233 ("wireless:
      don't write C files on failures"). This causes duplicate code in the
      generated files if the certificates have their timestamps modified
      between builds and thereby trigger the generation rules.
      
      Fixes: 715a1233 ("wireless: don't write C files on failures")
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      5d324073
    • H
      xfrm: Reinject transport-mode packets through tasklet · acf568ee
      Herbert Xu 提交于
      This is an old bugbear of mine:
      
      https://www.mail-archive.com/netdev@vger.kernel.org/msg03894.html
      
      By crafting special packets, it is possible to cause recursion
      in our kernel when processing transport-mode packets at levels
      that are only limited by packet size.
      
      The easiest one is with DNAT, but an even worse one is where
      UDP encapsulation is used in which case you just have to insert
      an UDP encapsulation header in between each level of recursion.
      
      This patch avoids this problem by reinjecting tranport-mode packets
      through a tasklet.
      
      Fixes: b05e1066 ("[IPV4/6]: Netfilter IPsec input hooks")
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      acf568ee
    • N
      net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks · 84aeb437
      Nikolay Aleksandrov 提交于
      The early call to br_stp_change_bridge_id in bridge's newlink can cause
      a memory leak if an error occurs during the newlink because the fdb
      entries are not cleaned up if a different lladdr was specified, also
      another minor issue is that it generates fdb notifications with
      ifindex = 0. Another unrelated memory leak is the bridge sysfs entries
      which get added on NETDEV_REGISTER event, but are not cleaned up in the
      newlink error path. To remove this special case the call to
      br_stp_change_bridge_id is done after netdev register and we cleanup the
      bridge on changelink error via br_dev_delete to plug all leaks.
      
      This patch makes netlink bridge destruction on newlink error the same as
      dellink and ioctl del which is necessary since at that point we have a
      fully initialized bridge device.
      
      To reproduce the issue:
      $ ip l add br0 address 00:11:22:33:44:55 type bridge group_fwd_mask 1
      RTNETLINK answers: Invalid argument
      
      $ rmmod bridge
      [ 1822.142525] =============================================================================
      [ 1822.143640] BUG bridge_fdb_cache (Tainted: G           O    ): Objects remaining in bridge_fdb_cache on __kmem_cache_shutdown()
      [ 1822.144821] -----------------------------------------------------------------------------
      
      [ 1822.145990] Disabling lock debugging due to kernel taint
      [ 1822.146732] INFO: Slab 0x0000000092a844b2 objects=32 used=2 fp=0x00000000fef011b0 flags=0x1ffff8000000100
      [ 1822.147700] CPU: 2 PID: 13584 Comm: rmmod Tainted: G    B      O     4.15.0-rc2+ #87
      [ 1822.148578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 1822.150008] Call Trace:
      [ 1822.150510]  dump_stack+0x78/0xa9
      [ 1822.151156]  slab_err+0xb1/0xd3
      [ 1822.151834]  ? __kmalloc+0x1bb/0x1ce
      [ 1822.152546]  __kmem_cache_shutdown+0x151/0x28b
      [ 1822.153395]  shutdown_cache+0x13/0x144
      [ 1822.154126]  kmem_cache_destroy+0x1c0/0x1fb
      [ 1822.154669]  SyS_delete_module+0x194/0x244
      [ 1822.155199]  ? trace_hardirqs_on_thunk+0x1a/0x1c
      [ 1822.155773]  entry_SYSCALL_64_fastpath+0x23/0x9a
      [ 1822.156343] RIP: 0033:0x7f929bd38b17
      [ 1822.156859] RSP: 002b:00007ffd160e9a98 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
      [ 1822.157728] RAX: ffffffffffffffda RBX: 00005578316ba090 RCX: 00007f929bd38b17
      [ 1822.158422] RDX: 00007f929bd9ec60 RSI: 0000000000000800 RDI: 00005578316ba0f0
      [ 1822.159114] RBP: 0000000000000003 R08: 00007f929bff5f20 R09: 00007ffd160e8a11
      [ 1822.159808] R10: 00007ffd160e9860 R11: 0000000000000202 R12: 00007ffd160e8a80
      [ 1822.160513] R13: 0000000000000000 R14: 0000000000000000 R15: 00005578316ba090
      [ 1822.161278] INFO: Object 0x000000007645de29 @offset=0
      [ 1822.161666] INFO: Object 0x00000000d5df2ab5 @offset=128
      
      Fixes: 30313a3d ("bridge: Handle IFLA_ADDRESS correctly when creating bridge device")
      Fixes: 5b8d5429 ("bridge: netlink: register netdevice before executing changelink")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84aeb437
    • X
      sctp: add SCTP_CID_RECONF conversion in sctp_cname · d1969759
      Xin Long 提交于
      Whenever a new type of chunk is added, the corresp conversion in
      sctp_cname should be added. Otherwise, in some places, pr_debug
      will print it as "unknown chunk".
      
      Fixes: cc16f00f ("sctp: add support for generating stream reconf ssn reset request chunk")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1969759
    • X
      sctp: fix the issue that a __u16 variable may overflow in sctp_ulpq_renege · 5c468674
      Xin Long 提交于
      Now when reneging events in sctp_ulpq_renege(), the variable freed
      could be increased by a __u16 value twice while freed is of __u16
      type. It means freed may overflow at the second addition.
      
      This patch is to fix it by using __u32 type for 'freed', while at
      it, also to remove 'if (chunk)' check, as all renege commands are
      generated in sctp_eat_data and it can't be NULL.
      Reported-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c468674
    • J
      tipc: remove leaving group member from all lists · 3f42f5fe
      Jon Maloy 提交于
      A group member going into state LEAVING should never go back to any
      other state before it is finally deleted. However, this might happen
      if the socket needs to send out a RECLAIM message during this interval.
      Since we forget to remove the leaving member from the group's 'active'
      or 'pending' list, the member might be selected for reclaiming, change
      state to RECLAIMING, and get stuck in this state instead of being
      deleted. This might lead to suppression of the expected 'member down'
      event to the receiver.
      
      We fix this by removing the member from all lists, except the RB tree,
      at the moment it goes into state LEAVING.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f42f5fe
    • J
      tipc: fix lost member events bug · 23483399
      Jon Maloy 提交于
      Group messages are not supposed to be returned to sender when the
      destination socket disappears. This is done correctly for regular
      traffic messages, by setting the 'dest_droppable' bit in the header.
      But we forget to do that in group protocol messages. This has the effect
      that such messages may sometimes bounce back to the sender, be perceived
      as a legitimate peer message, and wreak general havoc for the rest of
      the session. In particular, we have seen that a member in state LEAVING
      may go back to state RECLAIMED or REMITTED, hence causing suppression
      of an otherwise expected 'member down' event to the user.
      
      We fix this by setting the 'dest_droppable' bit even in group protocol
      messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23483399
  9. 17 12月, 2017 1 次提交
    • B
      ipv6: icmp6: Allow icmp messages to be looped back · 588753f1
      Brendan McGrath 提交于
      One example of when an ICMPv6 packet is required to be looped back is
      when a host acts as both a Multicast Listener and a Multicast Router.
      
      A Multicast Router will listen on address ff02::16 for MLDv2 messages.
      
      Currently, MLDv2 messages originating from a Multicast Listener running
      on the same host as the Multicast Router are not being delivered to the
      Multicast Router. This is due to dst.input being assigned the default
      value of dst_discard.
      
      This results in the packet being looped back but discarded before being
      delivered to the Multicast Router.
      
      This patch sets dst.input to ip6_input to ensure a looped back packet
      is delivered to the Multicast Router.
      Signed-off-by: NBrendan McGrath <redmcg@redmandi.dyndns.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      588753f1
  10. 16 12月, 2017 1 次提交