1. 11 10月, 2017 2 次提交
  2. 22 8月, 2017 1 次提交
    • D
      net: check type when freeing metadata dst · e65a4955
      David Lamparter 提交于
      Commit 3fcece12 ("net: store port/representator id in metadata_dst")
      added a new type field to metadata_dst, but metadata_dst_free() wasn't
      updated to check it before freeing the METADATA_IP_TUNNEL specific dst
      cache entry.
      
      This is not currently causing problems since it's far enough back in the
      struct to be zeroed for the only other type currently in existance
      (METADATA_HW_PORT_MUX), but nevertheless it's not correct.
      
      Fixes: 3fcece12 ("net: store port/representator id in metadata_dst")
      Signed-off-by: NDavid Lamparter <equinox@diac24.net>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e65a4955
  3. 19 8月, 2017 1 次提交
  4. 25 6月, 2017 1 次提交
    • J
      net: store port/representator id in metadata_dst · 3fcece12
      Jakub Kicinski 提交于
      Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
      representators and control messages over single set of hardware queues.
      Control messages and muxed traffic may need ordered delivery.
      
      Those requirements make it hard to comfortably use TC infrastructure today
      unless we have a way of attaching metadata to skbs at the upper device.
      Because single set of queues is used for many netdevs stopping TC/sched
      queues of all of them reliably is impossible and lower device has to
      retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
      the fastpath.
      
      This patch attempts to enable port/representative devs to attach metadata
      to skbs which carry port id.  This way representatives can be queueless and
      all queuing can be performed at the lower netdev in the usual way.
      
      Traffic arriving on the port/representative interfaces will be have
      metadata attached and will subsequently be queued to the lower device for
      transmission.  The lower device should recognize the metadata and translate
      it to HW specific format which is most likely either a special header
      inserted before the network headers or descriptor/metadata fields.
      
      Metadata is associated with the lower device by storing the netdev pointer
      along with port id so that if TC decides to redirect or mirror the new
      netdev will not try to interpret it.
      
      This is mostly for SR-IOV devices since switches don't have lower netdevs
      today.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcece12
  5. 18 6月, 2017 6 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      net: remove dst gc related code · 5b7c9a8f
      Wei Wang 提交于
      This patch removes all dst gc related code and all the dst free
      functions
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b7c9a8f
    • W
      xfrm: take refcnt of dst when creating struct xfrm_dst bundle · 52df157f
      Wei Wang 提交于
      During the creation of xfrm_dst bundle, always take ref count when
      allocating the dst. This way, xfrm_bundle_create() will form a linked
      list of dst with dst->child pointing to a ref counted dst child. And
      the returned dst pointer is also ref counted. This makes the link from
      the flow cache to this dst now ref counted properly.
      As the dst is always ref counted properly, we can safely mark
      DST_NOGC flag so dst_release() will release dst based on refcnt only.
      And dst gc is no longer needed and all dst_free() and its related
      function calls should be replaced with dst_release() or
      dst_release_immediate().
      
      The special handling logic for dst->child in dst_destroy() can be
      replaced with a simple dst_release_immediate() call on the child to
      release the whole list linked by dst->child pointer.
      Previously used DST_NOHASH flag is not needed anymore as well. The
      reason that DST_NOHASH is used in the existing code is mainly to prevent
      the dst inserted in the fib tree to be wrongly destroyed during the
      deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
      flag is marked in all the dst children except the one which is in the
      fib tree.
      However, with this patch series to remove dst gc logic and release dst
      only based on ref count, it is safe to release all the children from a
      xfrm_dst bundle as long as the dst children are all ref counted
      properly which is already the case in the existing code.
      So, this patch removes the use of DST_NOHASH flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52df157f
    • W
      net: introduce a new function dst_dev_put() · 4a6ce2b6
      Wei Wang 提交于
      This function should be called when removing routes from fib tree after
      the dst gc is no longer in use.
      We first mark DST_OBSOLETE_DEAD on this dst to make sure next
      dst_ops->check() fails and returns NULL.
      Secondly, as we no longer keep the gc_list, we need to properly
      release dst->dev right at the moment when the dst is removed from
      the fib/fib6 tree.
      It does the following:
      1. change dst->input and output pointers to dst_discard/dst_dscard_out to
         discard all packets
      2. replace dst->dev with loopback interface
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a6ce2b6
    • W
      net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt · 5f56f409
      Wei Wang 提交于
      The current mechanism of freeing dst is a bit complicated. dst has its
      ref count and when user grabs the reference to the dst, the ref count is
      properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
      code due to some historic reasons.
      
      If the reference to dst is always taken properly, we should be able to
      simplify the logic in dst_release() to destroy dst when dst->__refcnt
      drops from 1 to 0. And this should be the only condition to determine
      if we can call dst_destroy().
      And as dst is always ref counted, there is no need for a dst garbage
      list to hold the dst entries that already get removed by the routing
      code but are still held by other users. And the task to periodically
      check the list to free dst if ref count become 0 is also not needed
      anymore.
      
      This patch introduces a temporary flag DST_NOGC(no garbage collector).
      If it is set in the dst, dst_release() will call dst_destroy() when
      dst->__refcnt drops to 0. dst_hold_safe() will also check for this flag
      and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
      issue.
      This temporary flag is mainly used so that we can make the transition
      component by component without breaking other parts.
      This flag will be removed after all components are properly transitioned.
      
      This patch also introduces a new function dst_release_immediate() which
      destroys dst without waiting on the rcu when refcnt drops to 0. It will
      be used in later patches.
      
      Follow-up patches will correct all the places to properly take ref count
      on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
      be used to release the dst instead of dst_free() and its related
      functions.
      And final clean-up patch will remove the DST_NOGC flag.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f56f409
  6. 10 6月, 2017 1 次提交
    • K
      Fix an intermittent pr_emerg warning about lo becoming free. · f186ce61
      Krister Johansen 提交于
      It looks like this:
      
      Message from syslogd@flamingo at Apr 26 00:45:00 ...
       kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
      
      They seem to coincide with net namespace teardown.
      
      The message is emitted by netdev_wait_allrefs().
      
      Forced a kdump in netdev_run_todo, but found that the refcount on the lo
      device was already 0 at the time we got to the panic.
      
      Used bcc to check the blocking in netdev_run_todo.  The only places
      where we're off cpu there are in the rcu_barrier() and msleep() calls.
      That behavior is expected.  The msleep time coincides with the amount of
      time we spend waiting for the refcount to reach zero; the rcu_barrier()
      wait times are not excessive.
      
      After looking through the list of callbacks that the netdevice notifiers
      invoke in this path, it appears that the dst_dev_event is the most
      interesting.  The dst_ifdown path places a hold on the loopback_dev as
      part of releasing the dev associated with the original dst cache entry.
      Most of our notifier callbacks are straight-forward, but this one a)
      looks complex, and b) places a hold on the network interface in
      question.
      
      I constructed a new bcc script that watches various events in the
      liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
      the loopback device until the invalidated dst entry gets freed.
      
      [      __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
          __dst_free
          rcu_nocb_kthread
          kthread
          ret_from_fork
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f186ce61
  7. 27 5月, 2017 1 次提交
    • E
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet 提交于
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb07daf
  8. 08 2月, 2017 1 次提交
  9. 17 2月, 2016 1 次提交
  10. 07 1月, 2016 1 次提交
  11. 10 11月, 2015 1 次提交
  12. 08 10月, 2015 1 次提交
  13. 01 9月, 2015 1 次提交
  14. 26 8月, 2015 1 次提交
    • W
      route: fix a use-after-free · e252b3d1
      WANG Cong 提交于
      This patch fixes the following crash:
      
       general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc7+ #166
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff88010656d280 ti: ffff880106570000 task.ti: ffff880106570000
       RIP: 0010:[<ffffffff8182f91b>]  [<ffffffff8182f91b>] dst_destroy+0xa6/0xef
       RSP: 0018:ffff880107603e38  EFLAGS: 00010202
       RAX: 0000000000000001 RBX: ffff8800d225a000 RCX: ffffffff82250fd0
       RDX: 0000000000000001 RSI: ffffffff82250fd0 RDI: 6b6b6b6b6b6b6b6b
       RBP: ffff880107603e58 R08: 0000000000000001 R09: 0000000000000001
       R10: 000000000000b530 R11: ffff880107609000 R12: 0000000000000000
       R13: ffffffff82343c40 R14: 0000000000000000 R15: ffffffff8182fb4f
       FS:  0000000000000000(0000) GS:ffff880107600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00007fcabd9d3000 CR3: 00000000d7279000 CR4: 00000000000006e0
       Stack:
        ffffffff82250fd0 ffff8801077d6f00 ffffffff82253c40 ffff8800d225a000
        ffff880107603e68 ffffffff8182fb5d ffff880107603f08 ffffffff810d795e
        ffffffff810d7648 ffff880106574000 ffff88010656d280 ffff88010656d280
       Call Trace:
        <IRQ>
        [<ffffffff8182fb5d>] dst_destroy_rcu+0xe/0x1d
        [<ffffffff810d795e>] rcu_process_callbacks+0x618/0x7eb
        [<ffffffff810d7648>] ? rcu_process_callbacks+0x302/0x7eb
        [<ffffffff8182fb4f>] ? dst_gc_task+0x1eb/0x1eb
        [<ffffffff8107e11b>] __do_softirq+0x178/0x39f
        [<ffffffff8107e52e>] irq_exit+0x41/0x95
        [<ffffffff81a4f215>] smp_apic_timer_interrupt+0x34/0x40
        [<ffffffff81a4d5cd>] apic_timer_interrupt+0x6d/0x80
        <EOI>
        [<ffffffff8100b968>] ? default_idle+0x21/0x32
        [<ffffffff8100b966>] ? default_idle+0x1f/0x32
        [<ffffffff8100bf19>] arch_cpu_idle+0xf/0x11
        [<ffffffff810b0bc7>] default_idle_call+0x1f/0x21
        [<ffffffff810b0dce>] cpu_startup_entry+0x1ad/0x273
        [<ffffffff8102fe67>] start_secondary+0x135/0x156
      
      dst is freed right before lwtstate_put(), this is not correct...
      
      Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e252b3d1
  15. 21 8月, 2015 1 次提交
  16. 01 8月, 2015 1 次提交
    • A
      bpf: add helpers to access tunnel metadata · d3aa45ce
      Alexei Starovoitov 提交于
      Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
      bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
      skb: pointer to skb
      key: pointer to 'struct bpf_tunnel_key'
      size: size of 'struct bpf_tunnel_key'
      flags: room for future extensions
      
      First eBPF program that uses these helpers will allocate per_cpu
      metadata_dst structures that will be used on TX.
      On RX metadata_dst is allocated by tunnel driver.
      
      Typical usage for TX:
      struct bpf_tunnel_key tkey;
      ... populate tkey ...
      bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
      
      RX:
      struct bpf_tunnel_key tkey = {};
      bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      ... lookup or redirect based on tkey ...
      
      'struct bpf_tunnel_key' will be extended in the future by adding
      elements to the end and the 'size' argument will indicate which fields
      are populated, thereby keeping backwards compatibility.
      The 'flags' argument may be used as well when the 'size' is not enough or
      to indicate completely different layout of bpf_tunnel_key.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3aa45ce
  17. 22 7月, 2015 1 次提交
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
  18. 21 7月, 2015 1 次提交
  19. 10 12月, 2014 1 次提交
  20. 26 6月, 2014 1 次提交
    • E
      ipv4: fix dst race in sk_dst_get() · f8864972
      Eric Dumazet 提交于
      When IP route cache had been removed in linux-3.6, we broke assumption
      that dst entries were all freed after rcu grace period. DST_NOCACHE
      dst were supposed to be freed from dst_release(). But it appears
      we want to keep such dst around, either in UDP sockets or tunnels.
      
      In sk_dst_get() we need to make sure dst refcount is not 0
      before incrementing it, or else we might end up freeing a dst
      twice.
      
      DST_NOCACHE set on a dst does not mean this dst can not be attached
      to a socket or a tunnel.
      
      Then, before actual freeing, we need to observe a rcu grace period
      to make sure all other cpus can catch the fact the dst is no longer
      usable.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDormando <dormando@rydia.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8864972
  21. 16 4月, 2014 1 次提交
  22. 29 5月, 2013 1 次提交
  23. 02 4月, 2013 1 次提交
  24. 21 2月, 2013 1 次提交
  25. 23 8月, 2012 1 次提交
    • E
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet 提交于
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0115e8e3
  26. 14 8月, 2012 1 次提交
    • T
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo 提交于
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  27. 09 8月, 2012 1 次提交
    • E
      net: force dst_default_metrics to const section · a37e6e34
      Eric Dumazet 提交于
      While investigating on network performance problems, I found this little
      gem :
      
      $ nm -v vmlinux | grep -1 dst_default_metrics
      ffffffff82736540 b busy.46605
      ffffffff82736560 B dst_default_metrics
      ffffffff82736598 b dst_busy_list
      
      Apparently, declaring a const array without initializer put it in
      (writeable) bss section, in middle of possibly often dirtied cache
      lines.
      
      Since we really want dst_default_metrics be const to avoid any possible
      false sharing and catch any buggy writes, I force a null initializer.
      
      ffffffff818a4c20 R dst_default_metrics
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a37e6e34
  28. 01 8月, 2012 1 次提交
    • E
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet 提交于
      commit 404e0a8b (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54764bb6
  29. 31 7月, 2012 1 次提交
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  30. 21 7月, 2012 1 次提交
  31. 05 7月, 2012 2 次提交
  32. 06 12月, 2011 1 次提交
  33. 10 8月, 2011 1 次提交