1. 22 10月, 2013 1 次提交
    • E
      ipv6: sit: add GSO/TSO support · 61c1db7f
      Eric Dumazet 提交于
      Now ipv6_gso_segment() is stackable, its relatively easy to
      implement GSO/TSO support for SIT tunnels
      
      Performance results, when segmentation is done after tunnel
      device (as no NIC is yet enabled for TSO SIT support) :
      
      Before patch :
      
      lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
      MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      3168.31   4.81     4.64     2.988   2.877
      
      After patch :
      
      lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
      MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      5525.00   7.76     5.17     2.763   1.840
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61c1db7f
  2. 20 10月, 2013 5 次提交
    • H
      net: switch net_secret key generation to net_get_random_once · e34c9a69
      Hannes Frederic Sowa 提交于
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e34c9a69
    • H
      net: introduce new macro net_get_random_once · a48e4292
      Hannes Frederic Sowa 提交于
      net_get_random_once is a new macro which handles the initialization
      of secret keys. It is possible to call it in the fast path. Only the
      initialization depends on the spinlock and is rather slow. Otherwise
      it should get used just before the key is used to delay the entropy
      extration as late as possible to get better randomness. It returns true
      if the key got initialized.
      
      The usage of static_keys for net_get_random_once is a bit uncommon so
      it needs some further explanation why this actually works:
      
      === In the simple non-HAVE_JUMP_LABEL case we actually have ===
      no constrains to use static_key_(true|false) on keys initialized with
      STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
      the likely case that the initialization is already done. The key is
      initialized like this:
      
      ___done_key = { .enabled = ATOMIC_INIT(0) }
      
      The check
      
                      if (!static_key_true(&___done_key))                     \
      
      expands into (pseudo code)
      
                      if (!likely(___done_key > 0))
      
      , so we take the fast path as soon as ___done_key is increased from the
      helper function.
      
      === If HAVE_JUMP_LABELs are available this depends ===
      on patching of jumps into the prepared NOPs, which is done in
      jump_label_init at boot-up time (from start_kernel). It is forbidden
      and dangerous to use net_get_random_once in functions which are called
      before that!
      
      At compilation time NOPs are generated at the call sites of
      net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
      need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):
      
            71:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
            76:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      
      Both will be patched to the actual jumps to the end of the function to
      call __net_get_random_once at boot time as explained above.
      
      arch_static_branch is optimized and inlined for false as return value and
      actually also returns false in case the NOP is placed in the instruction
      stream. So in the fast case we get a "return false". But because we
      initialize ___done_key with (enabled != (entries & 1)) this call-site
      will get patched up at boot thus returning true. The final check looks
      like this:
      
                      if (!static_key_true(&___done_key))                     \
                              ___ret = __net_get_random_once(buf,             \
      
      expands to
      
                      if (!!static_key_false(&___done_key))                     \
                              ___ret = __net_get_random_once(buf,             \
      
      So we get true at boot time and as soon as static_key_slow_inc is called
      on the key it will invert the logic and return false for the fast path.
      static_key_slow_inc will change the branch because it got initialized
      with .enabled == 0. After static_key_slow_inc is called on the key the
      branch is replaced with a nop again.
      
      === Misc: ===
      The helper defers the increment into a workqueue so we don't
      have problems calling this code from atomic sections. A seperate boolean
      (___done) guards the case where we enter net_get_random_once again before
      the increment happend.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a48e4292
    • E
      ipip: add GSO/TSO support · cb32f511
      Eric Dumazet 提交于
      Now inet_gso_segment() is stackable, its relatively easy to
      implement GSO/TSO support for IPIP
      
      Performance results, when segmentation is done after tunnel
      device (as no NIC is yet enabled for TSO IPIP support) :
      
      Before patch :
      
      lpq83:~# ./netperf -H 7.7.9.84 -Cc
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      3357.88   5.09     3.70     2.983   2.167
      
      After patch :
      
      lpq83:~# ./netperf -H 7.7.9.84 -Cc
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  16384  16384    10.00      7710.19   4.52     6.62     1.152   1.687
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb32f511
    • E
      ipv4: gso: make inet_gso_segment() stackable · 3347c960
      Eric Dumazet 提交于
      In order to support GSO on IPIP, we need to make
      inet_gso_segment() stackable.
      
      It should not assume network header starts right after mac
      header.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3347c960
    • E
      net: generalize skb_segment() · 030737bc
      Eric Dumazet 提交于
      While implementing GSO/TSO support for IPIP, I found skb_segment()
      was assuming network header was immediately following mac header.
      
      Its not really true in the case inet_gso_segment() is stacked :
      By the time tcp_gso_segment() is called, network header points
      to the inner IP header.
      
      Let's instead assume nothing and pick the current offsets found in
      original skb, we have skb_headers_offset_update() helper for that.
      
      Also move the csum_start update inside skb_headers_offset_update()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      030737bc
  3. 18 10月, 2013 1 次提交
    • E
      net: refactor sk_page_frag_refill() · 400dfd3a
      Eric Dumazet 提交于
      While working on virtio_net new allocation strategy to increase
      payload/truesize ratio, we found that refactoring sk_page_frag_refill()
      was needed.
      
      This patch splits sk_page_frag_refill() into two parts, adding
      skb_page_frag_refill() which can be used without a socket.
      
      While we are at it, add a minimum frag size of 32 for
      sk_page_frag_refill()
      
      Michael will either use netdev_alloc_frag() from softirq context,
      or skb_page_frag_refill() from process context in refill_work()
       (GFP_KERNEL allocations)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Michael Dalton <mwdalton@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      400dfd3a
  4. 10 10月, 2013 1 次提交
    • E
      net: gro: allow to build full sized skb · 8a29111c
      Eric Dumazet 提交于
      skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
      typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
      
      It's relatively easy to extend the skb using frag_list to allow
      more frags to be appended into the last sk_buff.
      
      This still builds very efficient skbs, and allows reaching 45 MSS per
      skb.
      
      (45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
      sk_buff)
      
      High speed TCP flows benefit from this extension by lowering TCP stack
      cpu usage (less packets stored in receive queue, less ACK packets
      processed)
      
      Forwarding setups could be hurt, as such skbs will need to be
      linearized, although its not a new problem, as GRO could already
      provide skbs with a frag_list.
      
      We could make the 65536 bytes threshold a tunable to mitigate this.
      
      (First time we need to linearize skb in skb_needs_linearize(), we could
      lower the tunable to ~16*1460 so that following skb_gro_receive() calls
      build smaller skbs)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a29111c
  5. 09 10月, 2013 2 次提交
  6. 08 10月, 2013 3 次提交
    • E
      net: Separate the close_list and the unreg_list v2 · 5cde2829
      Eric W. Biederman 提交于
      Separate the unreg_list and the close_list in dev_close_many preventing
      dev_close_many from permuting the unreg_list.  The permutations of the
      unreg_list have resulted in cases where the loopback device is accessed
      it has been freed in code such as dst_ifdown.  Resulting in subtle memory
      corruption.
      
      This is the second bug from sharing the storage between the close_list
      and the unreg_list.  The issues that crop up with sharing are
      apparently too subtle to show up in normal testing or usage, so let's
      forget about being clever and use two separate lists.
      
      v2: Make all callers pass in a close_list to dev_close_many
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5cde2829
    • A
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov 提交于
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d45ed4a4
    • M
      netif_set_xps_queue: make cpu mask const · 3573540c
      Michael S. Tsirkin 提交于
      virtio wants to pass in cpumask_of(cpu), make parameter
      const to avoid build warnings.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3573540c
  7. 04 10月, 2013 1 次提交
  8. 01 10月, 2013 3 次提交
  9. 29 9月, 2013 3 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
    • E
      net: net_secret should not depend on TCP · 9a3bab6b
      Eric Dumazet 提交于
      A host might need net_secret[] and never open a single socket.
      
      Problem added in commit aebda156
      ("net: defer net_secret[] initialization")
      
      Based on prior patch from Hannes Frederic Sowa.
      Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@strressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a3bab6b
    • E
      net: Delay default_device_exit_batch until no devices are unregistering v2 · 50624c93
      Eric W. Biederman 提交于
      There is currently serialization network namespaces exiting and
      network devices exiting as the final part of netdev_run_todo does not
      happen under the rtnl_lock.  This is compounded by the fact that the
      only list of devices unregistering in netdev_run_todo is local to the
      netdev_run_todo.
      
      This lack of serialization in extreme cases results in network devices
      unregistering in netdev_run_todo after the loopback device of their
      network namespace has been freed (making dst_ifdown unsafe), and after
      the their network namespace has exited (making the NETDEV_UNREGISTER,
      and NETDEV_UNREGISTER_FINAL callbacks unsafe).
      
      Add the missing serialization by a per network namespace count of how
      many network devices are unregistering and having a wait queue that is
      woken up whenever the count is decreased.  The count and wait queue
      allow default_device_exit_batch to wait until all of the unregistration
      activity for a network namespace has finished before proceeding to
      unregister the loopback device and then allowing the network namespace
      to exit.
      
      Only a single global wait queue is used because there is a single global
      lock, and there is a single waiter, per network namespace wait queues
      would be a waste of resources.
      
      The per network namespace count of unregistering devices gives a
      progress guarantee because the number of network devices unregistering
      in an exiting network namespace must ultimately drop to zero (assuming
      network device unregistration completes).
      
      The basic logic remains the same as in v1.  This patch is now half
      comment and half rtnl_lock_unregistering an expanded version of
      wait_event performs no extra work in the common case where no network
      devices are unregistering when we get to default_device_exit_batch.
      Reported-by: NFrancesco Ruggeri <fruggeri@aristanetworks.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50624c93
  10. 27 9月, 2013 9 次提交
    • V
      net: create sysfs symlinks for neighbour devices · 5831d66e
      Veaceslav Falico 提交于
      Also, remove the same functionality from bonding - it will be already done
      for any device that links to its lower/upper neighbour.
      
      The links will be created for dev's kobject, and will look like
      lower_eth0 for lower device eth0 and upper_bridge0 for upper device
      bridge0.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5831d66e
    • V
      net: expose the master link to sysfs, and remove it from bond · 842d67a7
      Veaceslav Falico 提交于
      Currently, we can have only one master upper neighbour, so it would be
      useful to create a symlink to it in the sysfs device directory, the way
      that bonding now does it, for every device. Lower devices from
      bridge/team/etc will automagically get it, so we could rely on it.
      
      Also, remove the same functionality from bonding.
      
      CC: Jay Vosburgh <fubar@us.ibm.com>
      CC: Andy Gospodarek <andy@greyhouse.net>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      842d67a7
    • V
      net: add a possibility to get private from netdev_adjacent->list · b6ccba4c
      Veaceslav Falico 提交于
      It will be useful to get first/last element.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6ccba4c
    • V
      net: add for_each iterators through neighbour lower link's private · 31088a11
      Veaceslav Falico 提交于
      Add a possibility to iterate through netdev_adjacent's private, currently
      only for lower neighbours.
      
      Add both RCU and RTNL/other locking variants of iterators, and make the
      non-rcu variant to be safe from removal.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31088a11
    • V
      net: add netdev_adjacent->private and allow to use it · 402dae96
      Veaceslav Falico 提交于
      Currently, even though we can access any linked device, we can't attach
      anything to it, which is vital to properly manage them.
      
      To fix this, add a new void *private to netdev_adjacent and functions
      setting/getting it (per link), so that we can save, per example, bonding's
      slave structures there, per slave device.
      
      netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
      upper dev and populates the neighbour link only with private.
      
      netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      402dae96
    • V
      net: add RCU variant to search for netdev_adjacent link · 5249dec7
      Veaceslav Falico 提交于
      Currently we have only the RTNL flavour, however we can traverse it while
      holding only RCU, so add the RCU search. Add an RCU variant that uses
      list_head * as an argument, so that it can be universally used afterwards.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5249dec7
    • V
      net: add adj_list to save only neighbours · 2f268f12
      Veaceslav Falico 提交于
      Currently, we distinguish neighbours (first-level linked devices) from
      non-neighbours by the neighbour bool in the netdev_adjacent. This could be
      quite time-consuming in case we would like to traverse *only* through
      neighbours - cause we'd have to traverse through all devices and check for
      this flag, and in a (quite common) scenario where we have lots of vlans on
      top of bridge, which is on top of a bond - the bonding would have to go
      through all those vlans to get its upper neighbour linked devices.
      
      This situation is really unpleasant, cause there are already a lot of cases
      when a device with slaves needs to go through them in hot path.
      
      To fix this, introduce a new upper/lower device lists structure -
      adj_list, which contains only the neighbours. It works always in
      pair with the all_adj_list structure (renamed from upper/lower_dev_list),
      i.e. both of them contain the same links, only that all_adj_list contains
      also non-neighbour device links. It's really a small change visible,
      currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
      change the main linked logic at all.
      
      Also, add some comments a fix a name collision in
      netdev_for_each_upper_dev_rcu() and rework the naming by the following
      rules:
      
      netdev_(all_)(upper|lower)_*
      
      If "all_" is present, then we work with the whole list of upper/lower
      devices, otherwise - only with direct neighbours. Uninline functions - to
      get better stack traces.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f268f12
    • V
      net: use lists as arguments instead of bool upper · 7863c054
      Veaceslav Falico 提交于
      Currently we make use of bool upper when we want to specify if we want to
      work with upper/lower list. It's, however, harder to read, debug and
      occupies a lot more code.
      
      Fix this by just passing the correct upper/lower_dev_list list_head pointer
      instead of bool upper, and work internally with it.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7863c054
    • H
      net: neighbour: use source address of last enqueued packet for solicitation · 4ed377e3
      Hannes Frederic Sowa 提交于
      Currently we always use the first member of the arp_queue to determine
      the sender ip address of the arp packet (or in case of IPv6 - source
      address of the ndisc packet). This skb is fixed as long as the queue is
      not drained by a complete purge because of a timeout or by a successful
      response.
      
      If the first packet enqueued on the arp_queue is from a local application
      with a manually set source address and the to be discovered system
      does some kind of uRPF checks on the source address in the arp packet
      the resolving process hangs until a timeout and restarts. This hurts
      communication with the participating network node.
      
      This could be mitigated a bit if we use the latest enqueued skb's
      source address for the resolving process, which is not as static as
      the arp_queue's head. This change of the source address could result in
      better recovery of a failed solicitation.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Julian Anastasov <ja@ssi.bg>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed377e3
  11. 20 9月, 2013 1 次提交
    • N
      netpoll: fix NULL pointer dereference in netpoll_cleanup · d0fe8c88
      Nikolay Aleksandrov 提交于
      I've been hitting a NULL ptr deref while using netconsole because the
      np->dev check and the pointer manipulation in netpoll_cleanup are done
      without rtnl and the following sequence happens when having a netconsole
      over a vlan and we remove the vlan while disabling the netconsole:
      	CPU 1					CPU2
      					removes vlan and calls the notifier
      enters store_enabled(), calls
      netdev_cleanup which checks np->dev
      and then waits for rtnl
      					executes the netconsole netdev
      					release notifier making np->dev
      					== NULL and releases rtnl
      continues to dereference a member of
      np->dev which at this point is == NULL
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0fe8c88
  12. 13 9月, 2013 1 次提交
  13. 12 9月, 2013 1 次提交
    • E
      net: fix multiqueue selection · 50d1784e
      Eric Dumazet 提交于
      commit 416186fb ("net: Split core bits of netdev_pick_tx
      into __netdev_pick_tx") added a bug that disables caching of queue
      index in the socket.
      
      This is the source of packet reorders for TCP flows, and
      again this is happening more often when using FQ pacing.
      
      Old code was doing
      
      if (queue_index != old_index)
      	sk_tx_queue_set(sk, queue_index);
      
      Alexander renamed the variables but forgot to change sk_tx_queue_set()
      2nd parameter.
      
      if (queue_index != new_index)
      	sk_tx_queue_set(sk, queue_index);
      
      This means we store -1 over and over in sk->sk_tx_queue_mapping
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50d1784e
  14. 04 9月, 2013 3 次提交
  15. 31 8月, 2013 3 次提交
  16. 30 8月, 2013 2 次提交