1. 19 11月, 2016 1 次提交
    • W
      af_unix: conditionally use freezable blocking calls in read · 06a77b07
      WANG Cong 提交于
      Commit 2b15af6f ("af_unix: use freezable blocking calls in read")
      converts schedule_timeout() to its freezable version, it was probably
      correct at that time, but later, commit 2b514574
      ("net: af_unix: implement splice for stream af_unix sockets") breaks
      the strong requirement for a freezable sleep, according to
      commit 0f9548ca:
      
          We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
          deadlock if the lock is later acquired in the suspend or hibernate path
          (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
          cgroup_freezer if a lock is held inside a frozen cgroup that is later
          acquired by a process outside that group.
      
      The pipe_lock is still held at that point.
      
      So use freezable version only for the recvmsg call path, avoid impact for
      Android.
      
      Fixes: 2b514574 ("net: af_unix: implement splice for stream af_unix sockets")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06a77b07
  2. 18 11月, 2016 4 次提交
  3. 17 11月, 2016 2 次提交
  4. 16 11月, 2016 4 次提交
  5. 15 11月, 2016 7 次提交
  6. 14 11月, 2016 2 次提交
    • E
      tcp: take care of truncations done by sk_filter() · ac6e7800
      Eric Dumazet 提交于
      With syzkaller help, Marco Grassi found a bug in TCP stack,
      crashing in tcp_collapse()
      
      Root cause is that sk_filter() can truncate the incoming skb,
      but TCP stack was not really expecting this to happen.
      It probably was expecting a simple DROP or ACCEPT behavior.
      
      We first need to make sure no part of TCP header could be removed.
      Then we need to adjust TCP_SKB_CB(skb)->end_seq
      
      Many thanks to syzkaller team and Marco for giving us a reproducer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMarco Grassi <marco.gra@gmail.com>
      Reported-by: NVladis Dronov <vdronov@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac6e7800
    • S
      ipv4: use new_gw for redirect neigh lookup · 969447f2
      Stephen Suryaputra Lin 提交于
      In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
      and then the state of the neigh for the new_gw is checked. If the state
      isn't valid then the redirected route is deleted. This behavior is
      maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
      is assigned to peer->redirect_learned.a4 before calling
      ipv4_neigh_lookup().
      
      After commit 5943634f ("ipv4: Maintain redirect and PMTU info in
      struct rtable again."), ipv4_neigh_lookup() is performed without the
      rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
      isn't zero, the function uses it as the key. The neigh is most likely
      valid since the old_gw is the one that sends the ICMP redirect message.
      Then the new_gw is assigned to fib_nh_exception. The problem is: the
      new_gw ARP may never gets resolved and the traffic is blackholed.
      
      So, use the new_gw for neigh lookup.
      
      Changes from v1:
       - use __ipv4_neigh_lookup instead (per Eric Dumazet).
      
      Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
      Signed-off-by: NStephen Suryaputra Lin <ssurya@ieee.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969447f2
  7. 13 11月, 2016 2 次提交
  8. 11 11月, 2016 4 次提交
  9. 10 11月, 2016 5 次提交
    • D
      net: tcp response should set oif only if it is L3 master · 9b6c14d5
      David Ahern 提交于
      Lorenzo noted an Android unit test failed due to e0d56fdd:
      "The expectation in the test was that the RST replying to a SYN sent to a
      closed port should be generated with oif=0. In other words it should not
      prefer the interface where the SYN came in on, but instead should follow
      whatever the routing table says it should do."
      
      Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
      that the oif in the flow is set to the skb_iif only if skb_iif is an L3
      master.
      
      Fixes: e0d56fdd ("net: l3mdev: remove redundant calls")
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b6c14d5
    • M
      rtnl: reset calcit fptr in rtnl_unregister() · f567e950
      Mathias Krause 提交于
      To avoid having dangling function pointers left behind, reset calcit in
      rtnl_unregister(), too.
      
      This is no issue so far, as only the rtnl core registers a netlink
      handler with a calcit hook which won't be unregistered, but may become
      one if new code makes use of the calcit hook.
      
      Fixes: c7ac8679 ("rtnetlink: Compute and store minimum ifinfo...")
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f567e950
    • D
      net: icmp_route_lookup should use rt dev to determine L3 domain · 9d1a6c4e
      David Ahern 提交于
      icmp_send is called in response to some event. The skb may not have
      the device set (skb->dev is NULL), but it is expected to have an rt.
      Update icmp_route_lookup to use the rt on the skb to determine L3
      domain.
      
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d1a6c4e
    • M
      net-ipv6: on device mtu change do not add mtu to mtu-less routes · fb56be83
      Maciej Żenczykowski 提交于
      Routes can specify an mtu explicitly or inherit the mtu from
      the underlying device - this inheritance is implemented in
      dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().
      
      Currently changing the mtu of a device adds mtu explicitly
      to routes using that device.
      
      ie.
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
      
      After this patch the route entry no longer changes unless it already has an mtu.
      There is no need: this inheritance is already done in ip6_mtu()
      
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route add local 2000::2 dev lo mtu 2000
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 1501
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
        # ip -6 route del local 2000::2
      
      This is desirable because changing device mtu and then resetting it
      to the previous value shouldn't change the user visible routing table.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb56be83
    • S
      sock: fix sendmmsg for partial sendmsg · 3023898b
      Soheil Hassas Yeganeh 提交于
      Do not send the next message in sendmmsg for partial sendmsg
      invocations.
      
      sendmmsg assumes that it can continue sending the next message
      when the return value of the individual sendmsg invocations
      is positive. It results in corrupting the data for TCP,
      SCTP, and UNIX streams.
      
      For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
      of "aefgh" if the first sendmsg invocation sends only the first
      byte while the second sendmsg goes through.
      
      Datagram sockets either send the entire datagram or fail, so
      this patch affects only sockets of type SOCK_STREAM and
      SOCK_SEQPACKET.
      
      Fixes: 228e548e ("net: Add sendmmsg socket system call")
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3023898b
  10. 09 11月, 2016 5 次提交
    • L
      netfilter: nf_tables: fix oops when inserting an element into a verdict map · 58c78e10
      Liping Zhang 提交于
      Dalegaard says:
       The following ruleset, when loaded with 'nft -f bad.txt'
       ----snip----
       flush ruleset
       table ip inlinenat {
         map sourcemap {
           type ipv4_addr : verdict;
         }
      
         chain postrouting {
           ip saddr vmap @sourcemap accept
         }
       }
       add chain inlinenat test
       add element inlinenat sourcemap { 100.123.10.2 : jump test }
       ----snip----
      
       results in a kernel oops:
       BUG: unable to handle kernel paging request at 0000000000001344
       IP: [<ffffffffa07bf704>] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
       [...]
       Call Trace:
        [<ffffffffa07c2aae>] ? nft_data_init+0x13e/0x1a0 [nf_tables]
        [<ffffffffa07c1950>] nft_validate_register_store+0x60/0xb0 [nf_tables]
        [<ffffffffa07c74b5>] nft_add_set_elem+0x545/0x5e0 [nf_tables]
        [<ffffffffa07bfdd0>] ? nft_table_lookup+0x30/0x60 [nf_tables]
        [<ffffffff8132c630>] ? nla_strcmp+0x40/0x50
        [<ffffffffa07c766e>] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
        [<ffffffff8132c400>] ? nla_validate+0x60/0x80
        [<ffffffffa030d9b4>] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]
      
      Because we forget to fill the net pointer in bind_ctx, so dereferencing
      it may cause kernel crash.
      Reported-by: NDalegaard <dalegaard@gmail.com>
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      58c78e10
    • F
      netfilter: conntrack: refine gc worker heuristics · e0df8cae
      Florian Westphal 提交于
      Nicolas Dichtel says:
        After commit b87a2f91 ("netfilter: conntrack: add gc worker to
        remove timed-out entries"), netlink conntrack deletion events may be
        sent with a huge delay.
      
      Nicolas further points at this line:
      
        goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
      
      and indeed, this isn't optimal at all.  Rationale here was to ensure that
      we don't block other work items for too long, even if
      nf_conntrack_htable_size is huge.  But in order to have some guarantee
      about maximum time period where a scan of the full conntrack table
      completes we should always use a fixed slice size, so that once every
      N scans the full table has been examined at least once.
      
      We also need to balance this vs. the case where the system is either idle
      (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
      from packet path).
      
      So, after some discussion with Nicolas:
      
      1. want hard guarantee that we scan entire table at least once every X s
      -> need to scan fraction of table (get rid of upper bound)
      
      2. don't want to eat cycles on idle or very busy system
      -> increase interval if we did not evict any entries
      
      3. don't want to block other worker items for too long
      -> make fraction really small, and prefer small scan interval instead
      
      4. Want reasonable short time where we detect timed-out entry when
      system went idle after a burst of traffic, while not doing scans
      all the time.
      -> Store next gc scan in worker, increasing delays when no eviction
      happened and shrinking delay when we see timed out entries.
      
      The old gc interval is turned into a max number, scans can now happen
      every jiffy if stale entries are present.
      
      Longest possible time period until an entry is evicted is now 2 minutes
      in worst case (entry expires right after it was deemed 'not expired').
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e0df8cae
    • F
      netfilter: conntrack: fix CT target for UNSPEC helpers · 6114cc51
      Florian Westphal 提交于
      Thomas reports its not possible to attach the H.245 helper:
      
      iptables -t raw -A PREROUTING -p udp -j CT --helper H.245
      iptables: No chain/target/match by that name.
      xt_CT: No such helper "H.245"
      
      This is because H.245 registers as NFPROTO_UNSPEC, but the CT target
      passes NFPROTO_IPV4/IPV6 to nf_conntrack_helper_try_module_get.
      
      We should treat UNSPEC as wildcard and ignore the l3num instead.
      Reported-by: NThomas Woerner <twoerner@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6114cc51
    • F
      netfilter: connmark: ignore skbs with magic untracked conntrack objects · fb9c9649
      Florian Westphal 提交于
      The (percpu) untracked conntrack entries can end up with nonzero connmarks.
      
      The 'untracked' conntrack objects are merely a way to distinguish INVALID
      (i.e. protocol connection tracker says payload doesn't meet some
      requirements or packet was never seen by the connection tracking code)
      from packets that are intentionally not tracked (some icmpv6 types such as
      neigh solicitation, or by using 'iptables -j CT --notrack' option).
      
      Untracked conntrack objects are implementation detail, we might as well use
      invalid magic address instead to tell INVALID and UNTRACKED apart.
      
      Check skb->nfct for untracked dummy and behave as if skb->nfct is NULL.
      Reported-by: NXU Tianwen <evan.xu.tianwen@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fb9c9649
    • W
      ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr · 8fbfef7f
      WANG Cong 提交于
      family.maxattr is the max index for policy[], the size of
      ops[] is determined with ARRAY_SIZE().
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8fbfef7f
  11. 08 11月, 2016 4 次提交