1. 10 11月, 2016 5 次提交
    • D
      net: tcp response should set oif only if it is L3 master · 9b6c14d5
      David Ahern 提交于
      Lorenzo noted an Android unit test failed due to e0d56fdd:
      "The expectation in the test was that the RST replying to a SYN sent to a
      closed port should be generated with oif=0. In other words it should not
      prefer the interface where the SYN came in on, but instead should follow
      whatever the routing table says it should do."
      
      Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
      that the oif in the flow is set to the skb_iif only if skb_iif is an L3
      master.
      
      Fixes: e0d56fdd ("net: l3mdev: remove redundant calls")
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b6c14d5
    • M
      rtnl: reset calcit fptr in rtnl_unregister() · f567e950
      Mathias Krause 提交于
      To avoid having dangling function pointers left behind, reset calcit in
      rtnl_unregister(), too.
      
      This is no issue so far, as only the rtnl core registers a netlink
      handler with a calcit hook which won't be unregistered, but may become
      one if new code makes use of the calcit hook.
      
      Fixes: c7ac8679 ("rtnetlink: Compute and store minimum ifinfo...")
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f567e950
    • D
      net: icmp_route_lookup should use rt dev to determine L3 domain · 9d1a6c4e
      David Ahern 提交于
      icmp_send is called in response to some event. The skb may not have
      the device set (skb->dev is NULL), but it is expected to have an rt.
      Update icmp_route_lookup to use the rt on the skb to determine L3
      domain.
      
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d1a6c4e
    • M
      net-ipv6: on device mtu change do not add mtu to mtu-less routes · fb56be83
      Maciej Żenczykowski 提交于
      Routes can specify an mtu explicitly or inherit the mtu from
      the underlying device - this inheritance is implemented in
      dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().
      
      Currently changing the mtu of a device adds mtu explicitly
      to routes using that device.
      
      ie.
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
      
      After this patch the route entry no longer changes unless it already has an mtu.
      There is no need: this inheritance is already done in ip6_mtu()
      
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route add local 2000::2 dev lo mtu 2000
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 1501
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
        # ip -6 route del local 2000::2
      
      This is desirable because changing device mtu and then resetting it
      to the previous value shouldn't change the user visible routing table.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb56be83
    • S
      sock: fix sendmmsg for partial sendmsg · 3023898b
      Soheil Hassas Yeganeh 提交于
      Do not send the next message in sendmmsg for partial sendmsg
      invocations.
      
      sendmmsg assumes that it can continue sending the next message
      when the return value of the individual sendmsg invocations
      is positive. It results in corrupting the data for TCP,
      SCTP, and UNIX streams.
      
      For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
      of "aefgh" if the first sendmsg invocation sends only the first
      byte while the second sendmsg goes through.
      
      Datagram sockets either send the entire datagram or fail, so
      this patch affects only sockets of type SOCK_STREAM and
      SOCK_SEQPACKET.
      
      Fixes: 228e548e ("net: Add sendmmsg socket system call")
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3023898b
  2. 09 11月, 2016 5 次提交
    • L
      netfilter: nf_tables: fix oops when inserting an element into a verdict map · 58c78e10
      Liping Zhang 提交于
      Dalegaard says:
       The following ruleset, when loaded with 'nft -f bad.txt'
       ----snip----
       flush ruleset
       table ip inlinenat {
         map sourcemap {
           type ipv4_addr : verdict;
         }
      
         chain postrouting {
           ip saddr vmap @sourcemap accept
         }
       }
       add chain inlinenat test
       add element inlinenat sourcemap { 100.123.10.2 : jump test }
       ----snip----
      
       results in a kernel oops:
       BUG: unable to handle kernel paging request at 0000000000001344
       IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
       [...]
       Call Trace:
        [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
        [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
        [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
        [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
        [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
        [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
        [<ffffffff8132c400>] ? nla_validate+0x60/0x80
        [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]
      
      Because we forget to fill the net pointer in bind_ctx, so dereferencing
      it may cause kernel crash.
      Reported-by: NDalegaard <dalegaard@gmail.com>
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      58c78e10
    • F
      netfilter: conntrack: refine gc worker heuristics · e0df8cae
      Florian Westphal 提交于
      Nicolas Dichtel says:
        After commit b87a2f91 ("netfilter: conntrack: add gc worker to
        remove timed-out entries"), netlink conntrack deletion events may be
        sent with a huge delay.
      
      Nicolas further points at this line:
      
        goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
      
      and indeed, this isn't optimal at all.  Rationale here was to ensure that
      we don't block other work items for too long, even if
      nf_conntrack_htable_size is huge.  But in order to have some guarantee
      about maximum time period where a scan of the full conntrack table
      completes we should always use a fixed slice size, so that once every
      N scans the full table has been examined at least once.
      
      We also need to balance this vs. the case where the system is either idle
      (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
      from packet path).
      
      So, after some discussion with Nicolas:
      
      1. want hard guarantee that we scan entire table at least once every X s
      -> need to scan fraction of table (get rid of upper bound)
      
      2. don't want to eat cycles on idle or very busy system
      -> increase interval if we did not evict any entries
      
      3. don't want to block other worker items for too long
      -> make fraction really small, and prefer small scan interval instead
      
      4. Want reasonable short time where we detect timed-out entry when
      system went idle after a burst of traffic, while not doing scans
      all the time.
      -> Store next gc scan in worker, increasing delays when no eviction
      happened and shrinking delay when we see timed out entries.
      
      The old gc interval is turned into a max number, scans can now happen
      every jiffy if stale entries are present.
      
      Longest possible time period until an entry is evicted is now 2 minutes
      in worst case (entry expires right after it was deemed 'not expired').
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e0df8cae
    • F
      netfilter: conntrack: fix CT target for UNSPEC helpers · 6114cc51
      Florian Westphal 提交于
      Thomas reports its not possible to attach the H.245 helper:
      
      iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
      iptables: No chain/target/match by that name.
      xt_CT: No such helper "H.245"
      
      This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
      passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.
      
      We should treat UNSPEC as wildcard and ignore the l3num instead.
      Reported-by: NThomas Woerner <twoerner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6114cc51
    • F
      netfilter: connmark: ignore skbs with magic untracked conntrack objects · fb9c9649
      Florian Westphal 提交于
      The (percpu) untracked conntrack entries can end up with nonzero connmarks.
      
      The 'untracked' conntrack objects are merely a way to distinguish INVALID
      (i.e. protocol connection tracker says payload doesn't meet some
      requirements or packet was never seen by the connection tracking code)
      from packets that are intentionally not tracked (some icmpv6 types such as
      neigh solicitation, or by using 'iptables -j CT --notrack' option).
      
      Untracked conntrack objects are implementation detail, we might as well use
      invalid magic address instead to tell INVALID and UNTRACKED apart.
      
      Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.
      Reported-by: NXU Tianwen <evan.xu.tianwen@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fb9c9649
    • W
      ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr · 8fbfef7f
      WANG Cong 提交于
      family.maxattr is the max index for policy[], the size of
      ops[] is determined with ARRAY_SIZE().
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8fbfef7f
  3. 08 11月, 2016 3 次提交
  4. 04 11月, 2016 11 次提交
    • W
      genetlink: fix a memory leak on error path · 00ffc1ba
      WANG Cong 提交于
      In __genl_register_family(), when genl_validate_assign_mc_groups()
      fails, we forget to free the memory we possibly allocate for
      family->attrbuf.
      
      Note, some callers call genl_unregister_family() to clean up
      on error path, it doesn't work because the family is inserted
      to the global list in the nearly last step.
      
      Cc: Jakub Kicinski <kubakici@wp.pl>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00ffc1ba
    • E
      ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped · 990ff4d8
      Eric Dumazet 提交于
      While fuzzing kernel with syzkaller, Andrey reported a nasty crash
      in inet6_bind() caused by DCCP lacking a required method.
      
      Fixes: ab1e0a13 ("[SOCK] proto: Add hashinfo member to struct proto")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      990ff4d8
    • E
      ipv6: dccp: fix out of bound access in dccp_v6_err() · 1aa9d1a0
      Eric Dumazet 提交于
      dccp_v6_err() does not use pskb_may_pull() and might access garbage.
      
      We only need 4 bytes at the beginning of the DCCP header, like TCP,
      so the 8 bytes pulled in icmpv6_notify() are more than enough.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aa9d1a0
    • E
      netlink: netlink_diag_dump() runs without locks · 93636d1f
      Eric Dumazet 提交于
      A recent commit removed locking from netlink_diag_dump() but forgot
      one error case.
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      4.9.0-rc3+ #336 Not tainted
      -------------------------------------
      syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
      ) at:
      [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
      but there are no more locks to release!
      
      other info that might help us debug this:
      3 locks held by syz-executor/4018:
       #0: [   36.220068]  (
      sock_diag_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
       #1: [   36.220068]  (
      sock_diag_table_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
       #2: [   36.220068]  (
      nlk->cb_mutex[   36.220068] ){+.+.+.}
      , at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0
      
      stack backtrace:
      CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
       ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
       dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
       [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
      kernel/locking/lockdep.c:3388
       [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
       [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
       [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
       [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
       [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
       [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110
      
      Fixes: ad202074 ("netlink: Use rhashtable walk interface in diag dump")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93636d1f
    • E
      dccp: fix out of bound access in dccp_v4_err() · 6706a97f
      Eric Dumazet 提交于
      dccp_v4_err() does not use pskb_may_pull() and might access garbage.
      
      We only need 4 bytes at the beginning of the DCCP header, like TCP,
      so the 8 bytes pulled in icmp_socket_deliver() are more than enough.
      
      This patch might allow to process more ICMP messages, as some routers
      are still limiting the size of reflected bytes to 28 (RFC 792), instead
      of extended lengths (RFC 1812 4.3.2.3)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6706a97f
    • E
      dccp: do not send reset to already closed sockets · 346da62c
      Eric Dumazet 提交于
      Andrey reported following warning while fuzzing with syzkaller
      
      WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
       ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
       0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
       [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
       [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
       [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
       [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
       [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
       [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
       [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
       [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
       [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
       [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
       [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
       [<     inline     >] exit_task_work include/linux/task_work.h:21
       [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
       [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
       [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
       [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
       [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
      arch/x86/entry/common.c:156
       [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
       [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
      arch/x86/entry/common.c:259
       [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      
      Fix this the same way we did for TCP in commit 565b7b2d
      ("tcp: do not send reset to already closed sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      346da62c
    • E
      dccp: do not release listeners too soon · c3f24cfb
      Eric Dumazet 提交于
      Andrey Konovalov reported following error while fuzzing with syzkaller :
      
      IPv4: Attempt to release alive inet socket ffff880068e98940
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff88006b9e0000 task.stack: ffff880068770000
      RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
      selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
      RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
      RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
      RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
      RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
      R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
      R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
      FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
      Stack:
       ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
       ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
       ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
      Call Trace:
       [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
       [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
       [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
       [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
       [<     inline     >] dst_input ./include/net/dst.h:507
       [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
       [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
       [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
       [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
       [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
       [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
       [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
       [<     inline     >] new_sync_write fs/read_write.c:499
       [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
       [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
       [<     inline     >] SYSC_write fs/read_write.c:607
       [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
       [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It turns out DCCP calls __sk_receive_skb(), and this broke when
      lookups no longer took a reference on listeners.
      
      Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
      so that sock_put() is used only when needed.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f24cfb
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b
    • L
      ipv4: allow local fragmentation in ip_finish_output_gso() · 9ee6c5dc
      Lance Richardson 提交于
      Some configurations (e.g. geneve interface with default
      MTU of 1500 over an ethernet interface with 1500 MTU) result
      in the transmission of packets that exceed the configured MTU.
      While this should be considered to be a "bad" configuration,
      it is still allowed and should not result in the sending
      of packets that exceed the configured MTU.
      
      Fix by dropping the assumption in ip_finish_output_gso() that
      locally originated gso packets will never need fragmentation.
      Basic testing using iperf (observing CPU usage and bandwidth)
      have shown no measurable performance impact for traffic not
      requiring fragmentation.
      
      Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack")
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee6c5dc
    • E
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet 提交于
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac9e70b1
    • W
      inet: fix sleeping inside inet_wait_for_connect() · 14135f30
      WANG Cong 提交于
      Andrey reported this kernel warning:
      
        WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
        __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
        do not call blocking ops when !TASK_RUNNING; state=1 set at
        [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
        kernel/sched/wait.c:178
        Modules linked in:
        CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
         ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
         ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
         ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
        Call Trace:
         [<     inline     >] __dump_stack lib/dump_stack.c:15
         [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
         [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
         [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
         [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
         [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
         [<     inline     >] slab_alloc_node mm/slub.c:2634
         [<     inline     >] slab_alloc mm/slub.c:2716
         [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
         [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
         [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
         [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
         [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
         [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
         [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
         [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
         [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
         [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
         [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
         [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
         [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
         [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
         [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
         [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
         [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
         [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
        arch/x86/entry/entry_64.S:209
      
      Unlike commit 26cabd31
      ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
      sleeping function is called before schedule_timeout(), this is indeed
      a bug. Fix this by moving the wait logic to the new API, it is similar
      to commit ff960a73
      ("netdev, sched/wait: Fix sleeping inside wait event").
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14135f30
  5. 03 11月, 2016 1 次提交
    • E
      ip6_udp_tunnel: remove unused IPCB related codes · 4fd19c15
      Eli Cooper 提交于
      Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are
      never used before it reaches ip6tunnel_xmit(), and past that point the
      control buffer is no longer interpreted as IPCB.
      
      This clears these unused IPCB related codes. Currently there is no skb
      scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be
      cleared for IPv4 packets, as shown in 5146d1f1
      ("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fd19c15
  6. 02 11月, 2016 1 次提交
  7. 01 11月, 2016 9 次提交
  8. 31 10月, 2016 2 次提交
  9. 30 10月, 2016 3 次提交
    • J
      tipc: fix broadcast link synchronization problem · 06bd2b1e
      Jon Paul Maloy 提交于
      In commit 2d18ac4b ("tipc: extend broadcast link initialization
      criteria") we tried to fix a problem with the initial synchronization
      of broadcast link acknowledge values. Unfortunately that solution is
      not sufficient to solve the issue.
      
      We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
      non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
      initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
      broadcast acknowledge numbers, leading to premature opening of the
      broadcast link. When the bypassed packets finally arrive, they are
      inadvertently accepted, and the already correctly initialized
      acknowledge number in the broadcast receive link is overwritten by
      the invalid (zero) value of the said packets. After this the broadcast
      link goes stale.
      
      We now fix this by marking the packets where we know the acknowledge
      value is or may be invalid, and then ignoring the acks from those.
      
      To this purpose, we claim an unused bit in the header to indicate that
      the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
      synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
      packets, plus those LINK_PROTOCOL packets sent out before the broadcast
      links are fully synchronized.
      
      This minor protocol update is fully backwards compatible.
      Reported-by: NJohn Thompson <thompa.atl@gmail.com>
      Tested-by: NJohn Thompson <thompa.atl@gmail.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06bd2b1e
    • S
      rds: debug messages are enabled by default · ff57087f
      shamir rabinovitch 提交于
      rds use Kconfig option called "RDS_DEBUG" to enable rds debug messages.
      This option cause the rds Makefile to add -DDEBUG to the rds gcc command
      line.
      
      When CONFIG_DYNAMIC_DEBUG is enabled, the "DEBUG" macro is used by
      include/linux/dynamic_debug.h to decide if dynamic debug prints should
      be sent by default to the kernel log.
      
      rds should not enable this macro for production builds. rds dynamic
      debug work as expected follow this fix.
      Signed-off-by: NShamir Rabinovitch <shamir.rabinovitch@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff57087f
    • W
      packet: on direct_xmit, limit tso and csum to supported devices · 104ba78c
      Willem de Bruijn 提交于
      When transmitting on a packet socket with PACKET_VNET_HDR and
      PACKET_QDISC_BYPASS, validate device support for features requested
      in vnet_hdr.
      
      Drop TSO packets sent to devices that do not support TSO or have the
      feature disabled. Note that the latter currently do process those
      packets correctly, regardless of not advertising the feature.
      
      Because of SKB_GSO_DODGY, it is not sufficient to test device features
      with netif_needs_gso. Full validate_xmit_skb is needed.
      
      Switch to software checksum for non-TSO packets that request checksum
      offload if that device feature is unsupported or disabled. Note that
      similar to the TSO case, device drivers may perform checksum offload
      correctly even when not advertising it.
      
      When switching to software checksum, packets hit skb_checksum_help,
      which has two BUG_ON checksum not in linear segment. Packet sockets
      always allocate at least up to csum_start + csum_off + 2 as linear.
      
      Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c
      
        ethtool -K eth0 tso off tx on
        psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
        psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N
      
        ethtool -K eth0 tx off
        psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
        psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N
      
      v2:
        - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)
      
      Fixes: d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket option")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      104ba78c