1. 13 9月, 2017 1 次提交
  2. 25 8月, 2017 1 次提交
  3. 17 8月, 2017 1 次提交
    • L
      openvswitch: fix skb_panic due to the incorrect actions attrlen · 494bea39
      Liping Zhang 提交于
      For sw_flow_actions, the actions_len only represents the kernel part's
      size, and when we dump the actions to the userspace, we will do the
      convertions, so it's true size may become bigger than the actions_len.
      
      But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
      to alloc the skbuff, so the user_skb's size may become insufficient and
      oops will happen like this:
        skbuff: skb_over_panic: text:ffffffff8148fabf len:1749 put:157 head:
        ffff881300f39000 data:ffff881300f39000 tail:0x6d5 end:0x6c0 dev:<NULL>
        ------------[ cut here ]------------
        kernel BUG at net/core/skbuff.c:129!
        [...]
        Call Trace:
         <IRQ>
         [<ffffffff8148be82>] skb_put+0x43/0x44
         [<ffffffff8148fabf>] skb_zerocopy+0x6c/0x1f4
         [<ffffffffa0290d36>] queue_userspace_packet+0x3a3/0x448 [openvswitch]
         [<ffffffffa0292023>] ovs_dp_upcall+0x30/0x5c [openvswitch]
         [<ffffffffa028d435>] output_userspace+0x132/0x158 [openvswitch]
         [<ffffffffa01e6890>] ? ip6_rcv_finish+0x74/0x77 [ipv6]
         [<ffffffffa028e277>] do_execute_actions+0xcc1/0xdc8 [openvswitch]
         [<ffffffffa028e3f2>] ovs_execute_actions+0x74/0x106 [openvswitch]
         [<ffffffffa0292130>] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
         [<ffffffffa0292b77>] ? key_extract+0x63c/0x8d5 [openvswitch]
         [<ffffffffa029848b>] ovs_vport_receive+0xa1/0xc3 [openvswitch]
        [...]
      
      Also we can find that the actions_len is much little than the orig_len:
        crash> struct sw_flow_actions 0xffff8812f539d000
        struct sw_flow_actions {
          rcu = {
            next = 0xffff8812f5398800,
            func = 0xffffe3b00035db32
          },
          orig_len = 1384,
          actions_len = 592,
          actions = 0xffff8812f539d01c
        }
      
      So as a quick fix, use the orig_len instead of the actions_len to alloc
      the user_skb.
      
      Last, this oops happened on our system running a relative old kernel, but
      the same risk still exists on the mainline, since we use the wrong
      actions_len from the beginning.
      
      Fixes: ccea7445 ("openvswitch: include datapath actions with sampled-packet upcall to userspace")
      Cc: Neil McKee <neil.mckee@inmon.com>
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      494bea39
  4. 12 8月, 2017 1 次提交
  5. 25 7月, 2017 1 次提交
  6. 20 7月, 2017 2 次提交
    • T
      openvswitch: Optimize operations for OvS flow_stats. · c4b2bf6b
      Tonghao Zhang 提交于
      When calling the flow_free() to free the flow, we call many times
      (cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
      take up our CPU usage if we call the flow_free() frequently.
      When we put all packets to userspace via upcall, and OvS will send
      them back via netlink to ovs_packet_cmd_execute(will call flow_free).
      
      The test topo is shown as below. VM01 sends TCP packets to VM02,
      and OvS forward packtets. When testing, we use perf to report the
      system performance.
      
      VM01 --- OvS-VM --- VM02
      
      Without this patch, perf-top show as below: The flow_free() is
      3.02% CPU usage.
      
      	4.23%  [kernel]            [k] _raw_spin_unlock_irqrestore
      	3.62%  [kernel]            [k] __do_softirq
      	3.16%  [kernel]            [k] __memcpy
      	3.02%  [kernel]            [k] flow_free
      	2.42%  libc-2.17.so        [.] __memcpy_ssse3_back
      	2.18%  [kernel]            [k] copy_user_generic_unrolled
      	2.17%  [kernel]            [k] find_next_bit
      
      When applied this patch, perf-top show as below: Not shown on
      the list anymore.
      
      	4.11%  [kernel]            [k] _raw_spin_unlock_irqrestore
      	3.79%  [kernel]            [k] __do_softirq
      	3.46%  [kernel]            [k] __memcpy
      	2.73%  libc-2.17.so        [.] __memcpy_ssse3_back
      	2.25%  [kernel]            [k] copy_user_generic_unrolled
      	1.89%  libc-2.17.so        [.] _int_malloc
      	1.53%  ovs-vswitchd        [.] xlate_actions
      
      With this patch, the TCP throughput(we dont use Megaflow Cache
      + Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
      (maybe ~10% performance imporve).
      
      This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
      that the flow used. And we only check the flow_stats on the cpu we
      used, and it is unncessary to check all possible cpu when getting,
      cleaning, and updating the flow_stats. Adding the cpu_used_mask to
      sw_flow struct does’t increase the cacheline number.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4b2bf6b
    • T
      openvswitch: Optimize updating for OvS flow_stats. · c57c054e
      Tonghao Zhang 提交于
      In the ovs_flow_stats_update(), we only use the node
      var to alloc flow_stats struct. But this is not a
      common case, it is unnecessary to call the numa_node_id()
      everytime. This patch is not a bugfix, but there maybe
      a small increase.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c57c054e
  7. 18 7月, 2017 1 次提交
  8. 16 7月, 2017 1 次提交
    • G
      openvswitch: Fix for force/commit action failures · 8b97ac5b
      Greg Rose 提交于
      When there is an established connection in direction A->B, it is
      possible to receive a packet on port B which then executes
      ct(commit,force) without first performing ct() - ie, a lookup.
      In this case, we would expect that this packet can delete the existing
      entry so that we can commit a connection with direction B->A. However,
      currently we only perform a check in skb_nfct_cached() for whether
      OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a
      lookup previously occurred. In the above scenario, a lookup has not
      occurred but we should still be able to statelessly look up the
      existing entry and potentially delete the entry if it is in the
      opposite direction.
      
      This patch extends the check to also hint that if the action has the
      force flag set, then we will lookup the existing entry so that the
      force check at the end of skb_nfct_cached has the ability to delete
      the connection.
      
      Fixes: dd41d330b03 ("openvswitch: Add force commit.")
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: dev@openvswitch.org
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NGreg Rose <gvrose8192@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b97ac5b
  9. 03 7月, 2017 1 次提交
  10. 02 7月, 2017 1 次提交
  11. 25 6月, 2017 1 次提交
    • J
      net: store port/representator id in metadata_dst · 3fcece12
      Jakub Kicinski 提交于
      Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
      representators and control messages over single set of hardware queues.
      Control messages and muxed traffic may need ordered delivery.
      
      Those requirements make it hard to comfortably use TC infrastructure today
      unless we have a way of attaching metadata to skbs at the upper device.
      Because single set of queues is used for many netdevs stopping TC/sched
      queues of all of them reliably is impossible and lower device has to
      retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
      the fastpath.
      
      This patch attempts to enable port/representative devs to attach metadata
      to skbs which carry port id.  This way representatives can be queueless and
      all queuing can be performed at the lower netdev in the usual way.
      
      Traffic arriving on the port/representative interfaces will be have
      metadata attached and will subsequently be queued to the lower device for
      transmission.  The lower device should recognize the metadata and translate
      it to HW specific format which is most likely either a special header
      inserted before the network headers or descriptor/metadata fields.
      
      Metadata is associated with the lower device by storing the netdev pointer
      along with port id so that if TC decides to redirect or mirror the new
      netdev will not try to interpret it.
      
      This is mostly for SR-IOV devices since switches don't have lower netdevs
      today.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcece12
  12. 21 6月, 2017 1 次提交
  13. 16 6月, 2017 1 次提交
    • J
      networking: convert many more places to skb_put_zero() · b080db58
      Johannes Berg 提交于
      There were many places that my previous spatch didn't find,
      as pointed out by yuan linyu in various patches.
      
      The following spatch found many more and also removes the
      now unnecessary casts:
      
          @@
          identifier p, p2;
          expression len;
          expression skb;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, len);
          |
          -memset(p, 0, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, sizeof(*p));
          |
          -memset(p, 0, sizeof(*p));
          )
      
          @@
          expression skb, len;
          @@
          -memset(skb_put(skb, len), 0, len);
          +skb_put_zero(skb, len);
      
      Apply it to the tree (with one manual fixup to keep the
      comment in vxlan.c, which spatch removed.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b080db58
  14. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  15. 20 5月, 2017 1 次提交
  16. 15 5月, 2017 1 次提交
  17. 25 4月, 2017 3 次提交
    • J
      openvswitch: Delete conntrack entry clashing with an expectation. · cf5d7091
      Jarno Rajahalme 提交于
      Conntrack helpers do not check for a potentially clashing conntrack
      entry when creating a new expectation.  Also, nf_conntrack_in() will
      check expectations (via init_conntrack()) only if a conntrack entry
      can not be found.  The expectation for a packet which also matches an
      existing conntrack entry will not be removed by conntrack, and is
      currently handled inconsistently by OVS, as OVS expects the
      expectation to be removed when the connection tracking entry matching
      that expectation is confirmed.
      
      It should be noted that normally an IP stack would not allow reuse of
      a 5-tuple of an old (possibly lingering) connection for a new data
      connection, so this is somewhat unlikely corner case.  However, it is
      possible that a misbehaving source could cause conntrack entries be
      created that could then interfere with new related connections.
      
      Fix this in the OVS module by deleting the clashing conntrack entry
      after an expectation has been matched.  This causes the following
      nf_conntrack_in() call also find the expectation and remove it when
      creating the new conntrack entry, as well as the forthcoming reply
      direction packets to match the new related connection instead of the
      old clashing conntrack entry.
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Reported-by: NYang Song <yangsong@vmware.com>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cf5d7091
    • J
      openvswitch: Add eventmask support to CT action. · 12064551
      Jarno Rajahalme 提交于
      Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
      which can be used in conjunction with the commit flag
      (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
      conntrack events (IPCT_*) should be delivered via the Netfilter
      netlink multicast groups.  Default behavior depends on the system
      configuration, but typically a lot of events are delivered.  This can be
      very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
      types of events are of interest.
      
      Netfilter core init_conntrack() adds the event cache extension, so we
      only need to set the ctmask value.  However, if the system is
      configured without support for events, the setting will be skipped due
      to extension not being found.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Reviewed-by: NGreg Rose <gvrose8192@gmail.com>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12064551
    • J
      openvswitch: Typo fix. · abd0a4f2
      Jarno Rajahalme 提交于
      Fix typo in a comment.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NGreg Rose <gvrose8192@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abd0a4f2
  18. 15 4月, 2017 1 次提交
  19. 14 4月, 2017 1 次提交
  20. 02 4月, 2017 1 次提交
  21. 29 3月, 2017 1 次提交
    • J
      openvswitch: Fix refcount leak on force commit. · b768b16d
      Jarno Rajahalme 提交于
      The reference count held for skb needs to be released when the skb's
      nfct pointer is cleared regardless of if nf_ct_delete() is called or
      not.
      
      Failing to release the skb's reference cound led to deferred conntrack
      cleanup spinning forever within nf_conntrack_cleanup_net_list() when
      cleaning up a network namespace:
      
         kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
         kworker/u16:0-19025 [004] 45981067.173651: kernel_stack: <stack trace>
      => ___preempt_schedule (ffffffffa001ed36)
      => _raw_spin_unlock_bh (ffffffffa0713290)
      => nf_ct_iterate_cleanup (ffffffffc00a4454)
      => nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
      => nf_conntrack_pernet_exit (ffffffffc00a63dd)
      => ops_exit_list.isra.1 (ffffffffa06075f3)
      => cleanup_net (ffffffffa0607df0)
      => process_one_work (ffffffffa0084c31)
      => worker_thread (ffffffffa008592b)
      => kthread (ffffffffa008bee2)
      => ret_from_fork (ffffffffa071b67c)
      
      Fixes: dd41d33f ("openvswitch: Add force commit.")
      Reported-by: NYang Song <yangsong@vmware.com>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b768b16d
  22. 23 3月, 2017 4 次提交
    • A
      Openvswitch: Refactor sample and recirc actions implementation · bef7f756
      andy zhou 提交于
      Added clone_execute() that both the sample and the recirc
      action implementation can use.
      Signed-off-by: NAndy Zhou <azhou@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bef7f756
    • A
      openvswitch: Optimize sample action for the clone use cases · 798c1661
      andy zhou 提交于
      With the introduction of open flow 'clone' action, the OVS user space
      can now translate the 'clone' action into kernel datapath 'sample'
      action, with 100% probability, to ensure that the clone semantics,
      which is that the packet seen by the clone action is the same as the
      packet seen by the action after clone, is faithfully carried out
      in the datapath.
      
      While the sample action in the datpath has the matching semantics,
      its implementation is only optimized for its original use.
      Specifically, there are two limitation: First, there is a 3 level of
      nesting restriction, enforced at the flow downloading time. This
      limit turns out to be too restrictive for the 'clone' use case.
      Second, the implementation avoid recursive call only if the sample
      action list has a single userspace action.
      
      The main optimization implemented in this series removes the static
      nesting limit check, instead, implement the run time recursion limit
      check, and recursion avoidance similar to that of the 'recirc' action.
      This optimization solve both #1 and #2 issues above.
      
      One related optimization attempts to avoid copying flow key as
      long as the actions enclosed does not change the flow key. The
      detection is performed only once at the flow downloading time.
      
      Another related optimization is to rewrite the action list
      at flow downloading time in order to save the fast path from parsing
      the sample action list in its original form repeatedly.
      Signed-off-by: NAndy Zhou <azhou@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      798c1661
    • A
      openvswitch: Refactor recirc key allocation. · 4572ef52
      andy zhou 提交于
      The logic of allocating and copy key for each 'exec_actions_level'
      was specific to execute_recirc(). However, future patches will reuse
      as well.  Refactor the logic into its own function clone_key().
      Signed-off-by: NAndy Zhou <azhou@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4572ef52
    • A
      openvswitch: Deferred fifo API change. · 47c697aa
      andy zhou 提交于
      add_deferred_actions() API currently requires actions to be passed in
      as a fully encoded netlink message. So far both 'sample' and 'recirc'
      actions happens to carry actions as fully encoded netlink messages.
      However, this requirement is more restrictive than necessary, future
      patch will need to pass in action lists that are not fully encoded
      by themselves.
      Signed-off-by: NAndy Zhou <azhou@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47c697aa
  23. 17 3月, 2017 1 次提交
  24. 16 3月, 2017 1 次提交
  25. 03 3月, 2017 1 次提交
  26. 02 3月, 2017 1 次提交
    • E
      ipv6: orphan skbs in reassembly unit · 48cac18e
      Eric Dumazet 提交于
      Andrey reported a use-after-free in IPv6 stack.
      
      Issue here is that we free the socket while it still has skb
      in TX path and in some queues.
      
      It happens here because IPv6 reassembly unit messes skb->truesize,
      breaking skb_set_owner_w() badly.
      
      We fixed a similar issue for IPV4 in commit 8282f274 ("inet: frag:
      Always orphan skbs inside ip_defrag()")
      Acked-by: NJoe Stringer <joe@ovn.org>
      
      ==================================================================
      BUG: KASAN: use-after-free in sock_wfree+0x118/0x120
      Read of size 8 at addr ffff880062da0060 by task a.out/4140
      
      page:ffffea00018b6800 count:1 mapcount:0 mapping:          (null)
      index:0x0 compound_mapcount: 0
      flags: 0x100000000008100(slab|head)
      raw: 0100000000008100 0000000000000000 0000000000000000 0000000180130013
      raw: dead000000000100 dead000000000200 ffff88006741f140 0000000000000000
      page dumped because: kasan: bad access detected
      
      CPU: 0 PID: 4140 Comm: a.out Not tainted 4.10.0-rc3+ #59
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:15
       dump_stack+0x292/0x398 lib/dump_stack.c:51
       describe_address mm/kasan/report.c:262
       kasan_report_error+0x121/0x560 mm/kasan/report.c:370
       kasan_report mm/kasan/report.c:392
       __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:413
       sock_flag ./arch/x86/include/asm/bitops.h:324
       sock_wfree+0x118/0x120 net/core/sock.c:1631
       skb_release_head_state+0xfc/0x250 net/core/skbuff.c:655
       skb_release_all+0x15/0x60 net/core/skbuff.c:668
       __kfree_skb+0x15/0x20 net/core/skbuff.c:684
       kfree_skb+0x16e/0x4e0 net/core/skbuff.c:705
       inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
       inet_frag_put ./include/net/inet_frag.h:133
       nf_ct_frag6_gather+0x1125/0x38b0 net/ipv6/netfilter/nf_conntrack_reasm.c:617
       ipv6_defrag+0x21b/0x350 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
       nf_hook_entry_hookfn ./include/linux/netfilter.h:102
       nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
       nf_hook ./include/linux/netfilter.h:212
       __ip6_local_out+0x52c/0xaf0 net/ipv6/output_core.c:160
       ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
       ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
       ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
       rawv6_push_pending_frames net/ipv6/raw.c:613
       rawv6_sendmsg+0x2cff/0x4130 net/ipv6/raw.c:927
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
       sock_sendmsg_nosec net/socket.c:635
       sock_sendmsg+0xca/0x110 net/socket.c:645
       sock_write_iter+0x326/0x620 net/socket.c:848
       new_sync_write fs/read_write.c:499
       __vfs_write+0x483/0x760 fs/read_write.c:512
       vfs_write+0x187/0x530 fs/read_write.c:560
       SYSC_write fs/read_write.c:607
       SyS_write+0xfb/0x230 fs/read_write.c:599
       entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203
      RIP: 0033:0x7ff26e6f5b79
      RSP: 002b:00007ff268e0ed98 EFLAGS: 00000206 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007ff268e0f9c0 RCX: 00007ff26e6f5b79
      RDX: 0000000000000010 RSI: 0000000020f50fe1 RDI: 0000000000000003
      RBP: 00007ff26ebc1220 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
      R13: 00007ff268e0f9c0 R14: 00007ff26efec040 R15: 0000000000000003
      
      The buggy address belongs to the object at ffff880062da0000
       which belongs to the cache RAWv6 of size 1504
      The buggy address ffff880062da0060 is located 96 bytes inside
       of 1504-byte region [ffff880062da0000, ffff880062da05e0)
      
      Freed by task 4113:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
       save_stack+0x43/0xd0 mm/kasan/kasan.c:502
       set_track mm/kasan/kasan.c:514
       kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
       slab_free_hook mm/slub.c:1352
       slab_free_freelist_hook mm/slub.c:1374
       slab_free mm/slub.c:2951
       kmem_cache_free+0xb2/0x2c0 mm/slub.c:2973
       sk_prot_free net/core/sock.c:1377
       __sk_destruct+0x49c/0x6e0 net/core/sock.c:1452
       sk_destruct+0x47/0x80 net/core/sock.c:1460
       __sk_free+0x57/0x230 net/core/sock.c:1468
       sk_free+0x23/0x30 net/core/sock.c:1479
       sock_put ./include/net/sock.h:1638
       sk_common_release+0x31e/0x4e0 net/core/sock.c:2782
       rawv6_close+0x54/0x80 net/ipv6/raw.c:1214
       inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
       inet6_release+0x50/0x70 net/ipv6/af_inet6.c:431
       sock_release+0x8d/0x1e0 net/socket.c:599
       sock_close+0x16/0x20 net/socket.c:1063
       __fput+0x332/0x7f0 fs/file_table.c:208
       ____fput+0x15/0x20 fs/file_table.c:244
       task_work_run+0x19b/0x270 kernel/task_work.c:116
       exit_task_work ./include/linux/task_work.h:21
       do_exit+0x186b/0x2800 kernel/exit.c:839
       do_group_exit+0x149/0x420 kernel/exit.c:943
       SYSC_exit_group kernel/exit.c:954
       SyS_exit_group+0x1d/0x20 kernel/exit.c:952
       entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203
      
      Allocated by task 4115:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
       save_stack+0x43/0xd0 mm/kasan/kasan.c:502
       set_track mm/kasan/kasan.c:514
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
       slab_post_alloc_hook mm/slab.h:432
       slab_alloc_node mm/slub.c:2708
       slab_alloc mm/slub.c:2716
       kmem_cache_alloc+0x1af/0x250 mm/slub.c:2721
       sk_prot_alloc+0x65/0x2a0 net/core/sock.c:1334
       sk_alloc+0x105/0x1010 net/core/sock.c:1396
       inet6_create+0x44d/0x1150 net/ipv6/af_inet6.c:183
       __sock_create+0x4f6/0x880 net/socket.c:1199
       sock_create net/socket.c:1239
       SYSC_socket net/socket.c:1269
       SyS_socket+0xf9/0x230 net/socket.c:1249
       entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:203
      
      Memory state around the buggy address:
       ffff880062d9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff880062d9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff880062da0000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
       ffff880062da0080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff880062da0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ==================================================================
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48cac18e
  27. 20 2月, 2017 1 次提交
  28. 16 2月, 2017 1 次提交
  29. 10 2月, 2017 6 次提交
    • J
      openvswitch: Pack struct sw_flow_key. · 316d4d78
      Jarno Rajahalme 提交于
      struct sw_flow_key has two 16-bit holes. Move the most matched
      conntrack match fields there.  In some typical cases this reduces the
      size of the key that needs to be hashed into half and into one cache
      line.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316d4d78
    • J
      openvswitch: Add force commit. · dd41d33f
      Jarno Rajahalme 提交于
      Stateful network admission policy may allow connections to one
      direction and reject connections initiated in the other direction.
      After policy change it is possible that for a new connection an
      overlapping conntrack entry already exists, where the original
      direction of the existing connection is opposed to the new
      connection's initial packet.
      
      Most importantly, conntrack state relating to the current packet gets
      the "reply" designation based on whether the original direction tuple
      or the reply direction tuple matched.  If this "directionality" is
      wrong w.r.t. to the stateful network admission policy it may happen
      that packets in neither direction are correctly admitted.
      
      This patch adds a new "force commit" option to the OVS conntrack
      action that checks the original direction of an existing conntrack
      entry.  If that direction is opposed to the current packet, the
      existing conntrack entry is deleted and a new one is subsequently
      created in the correct direction.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd41d33f
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
    • J
      openvswitch: Inherit master's labels. · 09aa98ad
      Jarno Rajahalme 提交于
      We avoid calling into nf_conntrack_in() for expected connections, as
      that would remove the expectation that we want to stick around until
      we are ready to commit the connection.  Instead, we do a lookup in the
      expectation table directly.  However, after a successful expectation
      lookup we have set the flow key label field from the master
      connection, whereas nf_conntrack_in() does not do this.  This leads to
      master's labels being inherited after an expectation lookup, but those
      labels not being inherited after the corresponding conntrack action
      with a commit flag.
      
      This patch resolves the problem by changing the commit code path to
      also inherit the master's labels to the expected connection.
      Resolving this conflict in favor of inheriting the labels allows more
      information be passed from the master connection to related
      connections, which would otherwise be much harder if the 32 bits in
      the connmark are not enough.  Labels can still be set explicitly, so
      this change only affects the default values of the labels in presense
      of a master connection.
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09aa98ad
    • J
      openvswitch: Refactor labels initialization. · 6ffcea79
      Jarno Rajahalme 提交于
      Refactoring conntrack labels initialization makes changes in later
      patches easier to review.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ffcea79
    • J
      openvswitch: Simplify labels length logic. · b87cec38
      Jarno Rajahalme 提交于
      Since 23014011 ("netfilter: conntrack: support a fixed size of 128
      distinct labels"), the size of conntrack labels extension has fixed to
      128 bits, so we do not need to check for labels sizes shorter than 128
      at run-time.  This patch simplifies labels length logic accordingly,
      but allows the conntrack labels size to be increased in the future
      without breaking the build.  In the event of conntrack labels
      increasing in size OVS would still be able to deal with the 128 first
      label bits.
      Suggested-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87cec38