1. 10 11月, 2015 1 次提交
  2. 08 10月, 2015 1 次提交
  3. 01 9月, 2015 1 次提交
  4. 26 8月, 2015 1 次提交
    • W
      route: fix a use-after-free · e252b3d1
      WANG Cong 提交于
      This patch fixes the following crash:
      
       general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc7+ #166
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff88010656d280 ti: ffff880106570000 task.ti: ffff880106570000
       RIP: 0010:[<ffffffff8182f91b>]  [<ffffffff8182f91b>] dst_destroy+0xa6/0xef
       RSP: 0018:ffff880107603e38  EFLAGS: 00010202
       RAX: 0000000000000001 RBX: ffff8800d225a000 RCX: ffffffff82250fd0
       RDX: 0000000000000001 RSI: ffffffff82250fd0 RDI: 6b6b6b6b6b6b6b6b
       RBP: ffff880107603e58 R08: 0000000000000001 R09: 0000000000000001
       R10: 000000000000b530 R11: ffff880107609000 R12: 0000000000000000
       R13: ffffffff82343c40 R14: 0000000000000000 R15: ffffffff8182fb4f
       FS:  0000000000000000(0000) GS:ffff880107600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00007fcabd9d3000 CR3: 00000000d7279000 CR4: 00000000000006e0
       Stack:
        ffffffff82250fd0 ffff8801077d6f00 ffffffff82253c40 ffff8800d225a000
        ffff880107603e68 ffffffff8182fb5d ffff880107603f08 ffffffff810d795e
        ffffffff810d7648 ffff880106574000 ffff88010656d280 ffff88010656d280
       Call Trace:
        <IRQ>
        [<ffffffff8182fb5d>] dst_destroy_rcu+0xe/0x1d
        [<ffffffff810d795e>] rcu_process_callbacks+0x618/0x7eb
        [<ffffffff810d7648>] ? rcu_process_callbacks+0x302/0x7eb
        [<ffffffff8182fb4f>] ? dst_gc_task+0x1eb/0x1eb
        [<ffffffff8107e11b>] __do_softirq+0x178/0x39f
        [<ffffffff8107e52e>] irq_exit+0x41/0x95
        [<ffffffff81a4f215>] smp_apic_timer_interrupt+0x34/0x40
        [<ffffffff81a4d5cd>] apic_timer_interrupt+0x6d/0x80
        <EOI>
        [<ffffffff8100b968>] ? default_idle+0x21/0x32
        [<ffffffff8100b966>] ? default_idle+0x1f/0x32
        [<ffffffff8100bf19>] arch_cpu_idle+0xf/0x11
        [<ffffffff810b0bc7>] default_idle_call+0x1f/0x21
        [<ffffffff810b0dce>] cpu_startup_entry+0x1ad/0x273
        [<ffffffff8102fe67>] start_secondary+0x135/0x156
      
      dst is freed right before lwtstate_put(), this is not correct...
      
      Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e252b3d1
  5. 21 8月, 2015 1 次提交
  6. 01 8月, 2015 1 次提交
    • A
      bpf: add helpers to access tunnel metadata · d3aa45ce
      Alexei Starovoitov 提交于
      Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
      bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
      skb: pointer to skb
      key: pointer to 'struct bpf_tunnel_key'
      size: size of 'struct bpf_tunnel_key'
      flags: room for future extensions
      
      First eBPF program that uses these helpers will allocate per_cpu
      metadata_dst structures that will be used on TX.
      On RX metadata_dst is allocated by tunnel driver.
      
      Typical usage for TX:
      struct bpf_tunnel_key tkey;
      ... populate tkey ...
      bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
      
      RX:
      struct bpf_tunnel_key tkey = {};
      bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
      ... lookup or redirect based on tkey ...
      
      'struct bpf_tunnel_key' will be extended in the future by adding
      elements to the end and the 'size' argument will indicate which fields
      are populated, thereby keeping backwards compatibility.
      The 'flags' argument may be used as well when the 'size' is not enough or
      to indicate completely different layout of bpf_tunnel_key.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3aa45ce
  7. 22 7月, 2015 1 次提交
    • T
      dst: Metadata destinations · f38a9eb1
      Thomas Graf 提交于
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38a9eb1
  8. 21 7月, 2015 1 次提交
  9. 10 12月, 2014 1 次提交
  10. 26 6月, 2014 1 次提交
    • E
      ipv4: fix dst race in sk_dst_get() · f8864972
      Eric Dumazet 提交于
      When IP route cache had been removed in linux-3.6, we broke assumption
      that dst entries were all freed after rcu grace period. DST_NOCACHE
      dst were supposed to be freed from dst_release(). But it appears
      we want to keep such dst around, either in UDP sockets or tunnels.
      
      In sk_dst_get() we need to make sure dst refcount is not 0
      before incrementing it, or else we might end up freeing a dst
      twice.
      
      DST_NOCACHE set on a dst does not mean this dst can not be attached
      to a socket or a tunnel.
      
      Then, before actual freeing, we need to observe a rcu grace period
      to make sure all other cpus can catch the fact the dst is no longer
      usable.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDormando <dormando@rydia.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8864972
  11. 16 4月, 2014 1 次提交
  12. 29 5月, 2013 1 次提交
  13. 02 4月, 2013 1 次提交
  14. 21 2月, 2013 1 次提交
  15. 23 8月, 2012 1 次提交
    • E
      net: remove delay at device dismantle · 0115e8e3
      Eric Dumazet 提交于
      I noticed extra one second delay in device dismantle, tracked down to
      a call to dst_dev_event() while some call_rcu() are still in RCU queues.
      
      These call_rcu() were posted by rt_free(struct rtable *rt) calls.
      
      We then wait a little (but one second) in netdev_wait_allrefs() before
      kicking again NETDEV_UNREGISTER.
      
      As the call_rcu() are now completed, dst_dev_event() can do the needed
      device swap on busy dst.
      
      To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
      after a rcu_barrier(), but outside of RTNL lock.
      
      Use NETDEV_UNREGISTER_FINAL with care !
      
      Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL
      
      Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
      IP cache removal.
      
      With help from Gao feng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0115e8e3
  16. 14 8月, 2012 1 次提交
    • T
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo 提交于
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  17. 09 8月, 2012 1 次提交
    • E
      net: force dst_default_metrics to const section · a37e6e34
      Eric Dumazet 提交于
      While investigating on network performance problems, I found this little
      gem :
      
      $ nm -v vmlinux | grep -1 dst_default_metrics
      ffffffff82736540 b busy.46605
      ffffffff82736560 B dst_default_metrics
      ffffffff82736598 b dst_busy_list
      
      Apparently, declaring a const array without initializer put it in
      (writeable) bss section, in middle of possibly often dirtied cache
      lines.
      
      Since we really want dst_default_metrics be const to avoid any possible
      false sharing and catch any buggy writes, I force a null initializer.
      
      ffffffff818a4c20 R dst_default_metrics
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a37e6e34
  18. 01 8月, 2012 1 次提交
    • E
      ipv4: Restore old dst_free() behavior. · 54764bb6
      Eric Dumazet 提交于
      commit 404e0a8b (net: ipv4: fix RCU races on dst refcounts) tried
      to solve a race but added a problem at device/fib dismantle time :
      
      We really want to call dst_free() as soon as possible, even if sockets
      still have dst in their cache.
      dst_release() calls in free_fib_info_rcu() are not welcomed.
      
      Root of the problem was that now we also cache output routes (in
      nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
      rt_free(), because output route lookups are done in process context.
      
      Based on feedback and initial patch from David Miller (adding another
      call_rcu_bh() call in fib, but it appears it was not the right fix)
      
      I left the inet_sk_rx_dst_set() helper and added __rcu attributes
      to nh_rth_output and nh_rth_input to better document what is going on in
      this code.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54764bb6
  19. 31 7月, 2012 1 次提交
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  20. 21 7月, 2012 1 次提交
  21. 05 7月, 2012 2 次提交
  22. 06 12月, 2011 1 次提交
  23. 10 8月, 2011 1 次提交
  24. 18 7月, 2011 1 次提交
  25. 14 7月, 2011 1 次提交
    • D
      net: Embed hh_cache inside of struct neighbour. · f6b72b62
      David S. Miller 提交于
      Now that there is a one-to-one correspondance between neighbour
      and hh_cache entries, we no longer need:
      
      1) dynamic allocation
      2) attachment to dst->hh
      3) refcounting
      
      Initialization of the hh_cache entry is indicated by hh_len
      being non-zero, and such initialization is always done with
      the neighbour's lock held as a writer.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6b72b62
  26. 02 7月, 2011 1 次提交
    • D
      ipv6: Don't put artificial limit on routing table size. · 957c665f
      David S. Miller 提交于
      IPV6, unlike IPV4, doesn't have a routing cache.
      
      Routing table entries, as well as clones made in response
      to route lookup requests, all live in the same table.  And
      all of these things are together collected in the destination
      cache table for ipv6.
      
      This means that routing table entries count against the garbage
      collection limits, even though such entries cannot ever be reclaimed
      and are added explicitly by the administrator (rather than being
      created in response to lookups).
      
      Therefore it makes no sense to count ipv6 routing table entries
      against the GC limits.
      
      Add a DST_NOCOUNT destination cache entry flag, and skip the counting
      if it is set.  Use this flag bit in ipv6 when adding routing table
      entries.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      957c665f
  27. 25 5月, 2011 1 次提交
  28. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  29. 19 5月, 2011 1 次提交
    • D
      ipv4: Kill RT_CACHE_DEBUG · 6882f933
      David S. Miller 提交于
      It's way past it's usefulness.  And this gets rid of a bunch
      of stray ->rt_{dst,src} references.
      
      Even the comment documenting the macro was inaccurate (stated
      default was 1 when it's 0).
      
      If reintroduced, it should be done properly, with dynamic debug
      facilities.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6882f933
  30. 29 4月, 2011 2 次提交
  31. 18 2月, 2011 1 次提交
  32. 29 1月, 2011 1 次提交
  33. 27 1月, 2011 1 次提交
    • D
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller 提交于
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fa8a84
  34. 10 11月, 2010 1 次提交
  35. 20 10月, 2010 1 次提交
    • E
      net: avoid RCU for NOCACHE dst · 27b75c95
      Eric Dumazet 提交于
      There is no point using RCU for dst we allocate for a very short time
      (used once).
      
      Change dst_release() to take DST_NOCACHE into account, but also change
      skb_dst_set_noref() to force a refcount increment for such dst.
      
      This is a _huge_ gain, because we dont waste memory to store xx thousand
      of dsts. Instead of queueing them to RCU, we can free them instantly.
      
      CPU caches can stay hot, re-using same memory blocks to hold temporary
      dsts.
      
      Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
      since atomic_dec_return() implies a full memory barrier.
      
      Stress test, 160.000.000 udp frames sent, IP route cache disabled
      (DDOS).
      
      Before:
      
      real    0m38.091s
      user    0m13.189s
      sys     7m53.018s
      
      After:
      
      real	0m29.946s
      user	0m12.157s
      sys	7m40.605s
      
      For reference, if IP route cache was enabled :
      
      real	0m32.030s
      user	0m10.521s
      sys	8m15.243s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b75c95
  36. 12 10月, 2010 2 次提交
    • E
      net dst: use a percpu_counter to track entries · fc66f95c
      Eric Dumazet 提交于
      struct dst_ops tracks number of allocated dst in an atomic_t field,
      subject to high cache line contention in stress workload.
      
      Switch to a percpu_counter, to reduce number of time we need to dirty a
      central location. Place it on a separate cache line to avoid dirtying
      read only fields.
      
      Stress test :
      
      (Sending 160.000.000 UDP frames,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_TRIE, SLUB/NUMA)
      
      Before:
      
      real    0m51.179s
      user    0m15.329s
      sys     10m15.942s
      
      After:
      
      real	0m45.570s
      user	0m15.525s
      sys	9m56.669s
      
      With a small reordering of struct neighbour fields, subject of a
      following patch, (to separate refcnt from other read mostly fields)
      
      real	0m41.841s
      user	0m15.261s
      sys	8m45.949s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc66f95c
    • E
      neigh: speedup neigh_hh_init() · 34d101dd
      Eric Dumazet 提交于
      When a new dst is used to send a frame, neigh_resolve_output() tries to
      associate an struct hh_cache to this dst, calling neigh_hh_init() with
      the neigh rwlock write locked.
      
      Most of the time, hh_cache is already known and linked into neighbour,
      so we find it and increment its refcount.
      
      This patch changes the logic so that we call neigh_hh_init() with
      neighbour lock read locked only, so that fast path can be run in
      parallel by concurrent cpus.
      
      This brings part of the speedup we got with commit c7d4426a
      (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
      removing one of the contention point that routers hit on multiqueue
      enabled machines.
      
      Further improvements would need to use a seqlock instead of an rwlock to
      protect neigh->ha[], to not dirty neigh too often and remove two atomic
      ops.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34d101dd
  37. 21 7月, 2010 1 次提交