1. 22 12月, 2012 1 次提交
  2. 12 12月, 2012 1 次提交
    • E
      pkt_sched: avoid requeues if possible · 1abbe139
      Eric Dumazet 提交于
      With BQL being deployed, we can more likely have following behavior :
      
      We dequeue a packet from qdisc in dequeue_skb(), then we realize target
      tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
      skb into gso_skb for later.
      
      This shows in stats (tc -s qdisc dev eth0) as requeues.
      
      Problem of these requeues is that high priority packets can not be
      dequeued as long as this (possibly low prio and big TSO packet) is not
      removed from gso_skb.
      
      At 1Gbps speed, a full size TSO packet is 500 us of extra latency.
      
      In some cases, we know that all packets dequeued from a qdisc are
      for a particular and known txq :
      
      - If device is non multi queue
      - For all MQ/MQPRIO slave qdiscs
      
      This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
      this capability, so that dequeue_skb() is allowed to dequeue a packet
      only if the associated txq is not stopped.
      
      This indeed reduce latencies for high prio packets (or improve fairness
      with sfq/fq_codel), and almost remove qdisc 'requeues'.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1abbe139
  3. 29 11月, 2012 1 次提交
  4. 26 11月, 2012 1 次提交
  5. 22 11月, 2012 1 次提交
  6. 20 11月, 2012 1 次提交
  7. 19 11月, 2012 1 次提交
  8. 08 11月, 2012 1 次提交
  9. 07 11月, 2012 1 次提交
    • E
      htb: fix two bugs · 196d97f6
      Eric Dumazet 提交于
      Commit 56b765b7 (htb: improved accuracy at high rates)
      introduced two bugs :
      
      1) one bstats_update() was inadvertently removed from
         htb_dequeue_tree(), breaking statistics/rate estimation.
      
      2) Missing qdisc_put_rtab() calls in htb_change_class(),
         leaking kernel memory, now struct htb_class no longer
         retains pointers to qdisc_rate_table structs.
      
         Since only rate is used, dont use qdisc_get_rtab() calls
         copying data we ignore anyway.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Vimalkumar <j.vimal@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      196d97f6
  10. 04 11月, 2012 1 次提交
    • V
      htb: improved accuracy at high rates · 56b765b7
      Vimalkumar 提交于
      Current HTB (and TBF) uses rate table computed by the "tc"
      userspace program, which has the following issue:
      
      The rate table has 256 entries to map packet lengths
      to token (time units).  With TSO sized packets, the
      256 entry granularity leads to loss/gain of rate,
      making the token bucket inaccurate.
      
      Thus, instead of relying on rate table, this patch
      explicitly computes the time and accounts for packet
      transmission times with nanosecond granularity.
      
      This greatly improves accuracy of HTB with a wide
      range of packet sizes.
      
      Example:
      
      tc qdisc add dev $dev root handle 1: \
              htb default 1
      
      tc class add dev $dev classid 1:1 parent 1: \
              rate 5Gbit mtu 64k
      
      Here is an example of inaccuracy:
      
      $ iperf -c host -t 10 -i 1
      
      With old htb:
      eth4:   34.76 Mb/s In  5827.98 Mb/s Out -  65836.0 p/s In  481273.0 p/s Out
      [SUM]  9.0-10.0 sec   669 MBytes  5.61 Gbits/sec
      [SUM]  0.0-10.0 sec  6.50 GBytes  5.58 Gbits/sec
      
      With new htb:
      eth4:   28.36 Mb/s In  5208.06 Mb/s Out -  53704.0 p/s In  430076.0 p/s Out
      [SUM]  9.0-10.0 sec   594 MBytes  4.98 Gbits/sec
      [SUM]  0.0-10.0 sec  5.80 GBytes  4.98 Gbits/sec
      
      The bits per second on the wire is still 5200Mb/s with new HTB
      because qdisc accounts for packet length using skb->len, which
      is smaller than total bytes on the wire if GSO is used.  But
      that is for another patch regardless of how time is accounted.
      
      Many thanks to Eric Dumazet for review and feedback.
      Signed-off-by: NVimalkumar <j.vimal@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56b765b7
  11. 26 10月, 2012 1 次提交
    • D
      cgroup: net_cls: Rework update socket logic · 6a328d8c
      Daniel Wagner 提交于
      The cgroup logic part of net_cls is very similar as the one in
      net_prio. Let's stream line the net_cls logic with the net_prio one.
      
      The net_prio update logic was changed by following commit (note there
      were some changes necessary later on)
      
      commit 406a3c63
      Author: John Fastabend <john.r.fastabend@intel.com>
      Date:   Fri Jul 20 10:39:25 2012 +0000
      
          net: netprio_cgroup: rework update socket logic
      
          Instead of updating the sk_cgrp_prioidx struct field on every send
          this only updates the field when a task is moved via cgroup
          infrastructure.
      
          This allows sockets that may be used by a kernel worker thread
          to be managed. For example in the iscsi case today a user can
          put iscsid in a netprio cgroup and control traffic will be sent
          with the correct sk_cgrp_prioidx value set but as soon as data
          is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
          is updated with the kernel worker threads value which is the
          default case.
      
          It seems more correct to only update the field when the user
          explicitly sets it via control group infrastructure. This allows
          the users to manage sockets that may be used with other threads.
      
      Since classid is now updated when the task is moved between the
      cgroups, we don't have to call sock_update_classid() from various
      places to ensure we always using the latest classid value.
      
      [v2: Use iterate_fd() instead of open coding]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc:  Li Zefan <lizefan@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <netdev@vger.kernel.org>
      Cc: <cgroups@vger.kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a328d8c
  12. 22 10月, 2012 1 次提交
  13. 28 9月, 2012 1 次提交
    • D
      pkt_sched: Fix warning false positives. · f54ba779
      David S. Miller 提交于
      GCC refuses to recognize that all error control flows do in fact
      set err to something.
      
      Add an explicit initialization to shut it up.
      
      net/sched/sch_drr.c: In function ‘drr_enqueue’:
      net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
      net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54ba779
  14. 25 9月, 2012 1 次提交
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  15. 20 9月, 2012 1 次提交
    • P
      pkt_sched: fix virtual-start-time update in QFQ · 71261956
      Paolo Valente 提交于
      If the old timestamps of a class, say cl, are stale when the class
      becomes active, then QFQ may assign to cl a much higher start time
      than the maximum value allowed. This may happen when QFQ assigns to
      the start time of cl the finish time of a group whose classes are
      characterized by a higher value of the ratio
      max_class_pkt/weight_of_the_class with respect to that of
      cl. Inserting a class with a too high start time into the bucket list
      corrupts the data structure and may eventually lead to crashes.
      This patch limits the maximum start time assigned to a class.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71261956
  16. 15 9月, 2012 2 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
  17. 14 9月, 2012 4 次提交
  18. 12 9月, 2012 1 次提交
  19. 11 9月, 2012 1 次提交
  20. 06 9月, 2012 1 次提交
    • E
      net: qdisc busylock needs lockdep annotations · 23d3b8bf
      Eric Dumazet 提交于
      It seems we need to provide ability for stacked devices
      to use specific lock_class_key for sch->busylock
      
      We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
      a user might use a qdisc anyway.
      
      (So same fixes are probably needed on non LLTX stacked drivers)
      
      Noticed while stressing L2TPV3 setup :
      
      ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.6.0-rc3+ #788 Not tainted
       -------------------------------------------------------
       netperf/4660 is trying to acquire lock:
        (l2tpsock){+.-...}, at: [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
      
       but task is already holding lock:
        (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&(&sch->busylock)->rlock){+.-...}:
              [<ffffffff810a5df0>] lock_acquire+0x90/0x200
              [<ffffffff817499fc>] _raw_spin_lock_irqsave+0x4c/0x60
              [<ffffffff81074872>] __wake_up+0x32/0x70
              [<ffffffff8136d39e>] tty_wakeup+0x3e/0x80
              [<ffffffff81378fb3>] pty_write+0x73/0x80
              [<ffffffff8136cb4c>] tty_put_char+0x3c/0x40
              [<ffffffff813722b2>] process_echoes+0x142/0x330
              [<ffffffff813742ab>] n_tty_receive_buf+0x8fb/0x1230
              [<ffffffff813777b2>] flush_to_ldisc+0x142/0x1c0
              [<ffffffff81062818>] process_one_work+0x198/0x760
              [<ffffffff81063236>] worker_thread+0x186/0x4b0
              [<ffffffff810694d3>] kthread+0x93/0xa0
              [<ffffffff81753e24>] kernel_thread_helper+0x4/0x10
      
       -> #0 (l2tpsock){+.-...}:
              [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
              [<ffffffff810a5df0>] lock_acquire+0x90/0x200
              [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
              [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
              [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
              [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
              [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
              [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
              [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
              [<ffffffff815db019>] ip_output+0x59/0xf0
              [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
              [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
              [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
              [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
              [<ffffffff815f5300>] tcp_push_one+0x30/0x40
              [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
              [<ffffffff81614495>] inet_sendmsg+0x125/0x230
              [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
              [<ffffffff81579ece>] sys_sendto+0xfe/0x130
              [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&(&sch->busylock)->rlock);
                                      lock(l2tpsock);
                                      lock(&(&sch->busylock)->rlock);
         lock(l2tpsock);
      
        *** DEADLOCK ***
      
       5 locks held by netperf/4660:
        #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e581c>] tcp_sendmsg+0x2c/0x1040
        #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff815da3e0>] ip_queue_xmit+0x0/0x680
        #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff815d9ac5>] ip_finish_output+0x135/0x890
        #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81595820>] dev_queue_xmit+0x0/0xe00
        #4:  (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00
      
       stack backtrace:
       Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
       Call Trace:
        [<ffffffff8173dbf8>] print_circular_bug+0x1fb/0x20c
        [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
        [<ffffffff810a334b>] ? check_usage+0x9b/0x4d0
        [<ffffffff810a3f44>] ? __lock_acquire+0x2e4/0x1b10
        [<ffffffff810a5df0>] lock_acquire+0x90/0x200
        [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
        [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
        [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
        [<ffffffff81594e0e>] ? dev_hard_start_xmit+0x5e/0xa70
        [<ffffffff81595961>] ? dev_queue_xmit+0x141/0xe00
        [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
        [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
        [<ffffffff81595820>] ? dev_hard_start_xmit+0xa70/0xa70
        [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
        [<ffffffff815d9ac5>] ? ip_finish_output+0x135/0x890
        [<ffffffff815db019>] ip_output+0x59/0xf0
        [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
        [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
        [<ffffffff815da3e0>] ? ip_local_out+0xa0/0xa0
        [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
        [<ffffffff815fa25e>] ? tcp_md5_do_lookup+0x18e/0x1a0
        [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
        [<ffffffff815f5300>] tcp_push_one+0x30/0x40
        [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
        [<ffffffff81614495>] inet_sendmsg+0x125/0x230
        [<ffffffff81614370>] ? inet_create+0x6b0/0x6b0
        [<ffffffff8157e6e2>] ? sock_update_classid+0xc2/0x3b0
        [<ffffffff8157e750>] ? sock_update_classid+0x130/0x3b0
        [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
        [<ffffffff81162579>] ? fget_light+0x3f9/0x4f0
        [<ffffffff81579ece>] sys_sendto+0xfe/0x130
        [<ffffffff810a69ad>] ? trace_hardirqs_on+0xd/0x10
        [<ffffffff8174a0b0>] ? _raw_spin_unlock_irq+0x30/0x50
        [<ffffffff810757e3>] ? finish_task_switch+0x83/0xf0
        [<ffffffff810757a6>] ? finish_task_switch+0x46/0xf0
        [<ffffffff81752cb7>] ? sysret_check+0x1b/0x56
        [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23d3b8bf
  21. 04 9月, 2012 1 次提交
  22. 17 8月, 2012 1 次提交
    • J
      act_mirred: do not drop packets when fails to mirror it · 16c0b164
      Jason Wang 提交于
      We drop packet unconditionally when we fail to mirror it. This is not intended
      in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
      to a tap device used by a VM. When kernel fails to mirror the packet in
      conditions such as when qemu crashes or stop polling the tap, it's hard for the
      management software to detect such condition and clean the the mirroring
      before. This would lead all packets to the bridge to be dropped and break the
      netowrk of other virtual machines.
      
      To solve the issue, the patch does not drop packets when kernel fails to mirror
      it, and only drop the redirected packets.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16c0b164
  23. 15 8月, 2012 3 次提交
    • E
      userns: Convert cls_flow to work with user namespaces enabled · a6c6796c
      Eric W. Biederman 提交于
      The flow classifier can use uids and gids of the sockets that
      are transmitting packets and do insert those uids and gids
      into the packet classification calcuation.  I don't fully
      understand the details but it appears that we can depend
      on specific uids and gids when making traffic classification
      decisions.
      
      To work with user namespaces enabled map from kuids and kgids
      into uids and gids in the initial user namespace giving raw
      integer values the code can play with and depend on.
      
      To avoid issues of userspace depending on uids and gids in
      packet classifiers installed from other user namespaces
      and getting confused deny all packet classifiers that
      use uids or gids that are not comming from a netlink socket
      in the initial user namespace.
      
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Changli Gao <xiaosuo@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      a6c6796c
    • E
      net sched: Pass the skb into change so it can access NETLINK_CB · af4c6641
      Eric W. Biederman 提交于
      cls_flow.c plays with uids and gids.  Unless I misread that
      code it is possible for classifiers to depend on the specific uid and
      gid values.  Therefore I need to know the user namespace of the
      netlink socket that is installing the packet classifiers.  Pass
      in the rtnetlink skb so I can access the NETLINK_CB of the passed
      packet.  In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).
      
      Pass in not the user namespace but the incomming rtnetlink skb into
      the the classifier change routines as that is generally the more useful
      parameter.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      af4c6641
    • A
      net: move and rename netif_notify_peers() · ee89bab1
      Amerigo Wang 提交于
      I believe net/core/dev.c is a better place for netif_notify_peers(),
      because other net event notify functions also stay in this file.
      
      And rename it to netdev_notify_peers().
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee89bab1
  24. 09 8月, 2012 1 次提交
    • P
      sched: add missing group change to qfq_change_class · be72f63b
      Paolo Valente 提交于
      [Resending again, as the text was corrupted by the email client]
      
      To speed up operations, QFQ internally divides classes into
      groups. Which group a class belongs to depends on the ratio between
      the maximum packet length and the weight of the class. Unfortunately
      the function qfq_change_class lacks the steps for changing the group
      of a class when the ratio max_pkt_len/weight of the class changes.
      
      For example, when the last of the following three commands is
      executed, the group of class 1:1 is not correctly changed:
      
      tc disc add dev XXX root handle 1: qfq
      tc class add dev XXX parent 1: qfq classid 1:1 weight 1
      tc class change dev XXX parent 1: classid 1:1 qfq weight 4
      
      Not changing the group of a class does not affect the long-term
      bandwidth guaranteed to the class, as the latter is independent of the
      maximum packet length, and correctly changes (only) if the weight of
      the class changes. In contrast, if the group of the class is not
      updated, the class is still guaranteed the short-term bandwidth and
      packet delay related to its old group, instead of the guarantees that
      it should receive according to its new weight and/or maximum packet
      length. This may also break service guarantees for other classes.
      This patch adds the missing operations.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be72f63b
  25. 07 8月, 2012 1 次提交
  26. 04 8月, 2012 1 次提交
  27. 24 7月, 2012 1 次提交
    • D
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller 提交于
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      callback.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92101b3b
  28. 17 7月, 2012 1 次提交
    • E
      netem: refine early skb orphaning · 5a308f40
      Eric Dumazet 提交于
      netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
      or any mechanism relying on socket sk_wmem_alloc feedback.
      
      Ideally, we should perform this orphaning after the rate module and
      before the delay module, to mimic what happens on a real link :
      
      skb orphaning is indeed normally done at TX completion, before the
      transit on the link.
      
      +-------+   +--------+  +---------------+  +-----------------+
      + Qdisc +---> Device +--> TX completion +--> links / hops    +->
      +       +   +  xmit  +  + skb orphaning +  + propagation     +
      +-------+   +--------+  +---------------+  +-----------------+
            < rate limiting >                  < delay, drops, reorders >
      
      If netem is used without delay feature (drops, reorders, rate
      limiting), then we should avoid early skb orphaning, to keep pressure
      on sockets as long as packets are still in qdisc queue.
      
      Ideally, netem should be refactored to implement delay module
      as the last stage. Current algorithm merges the two phases
      (rate limiting + delay) so its not correct.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Mark Gordon <msg@google.com>
      Cc: Andreas Terzis <aterzis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a308f40
  29. 12 7月, 2012 2 次提交
  30. 09 7月, 2012 1 次提交
    • E
      netem: add limitation to reordered packets · 960fb66e
      Eric Dumazet 提交于
      Fix two netem bugs :
      
      1) When a frame was dropped by tfifo_enqueue(), drop counter
         was incremented twice.
      
      2) When reordering is triggered, we enqueue a packet without
         checking queue limit. This can OOM pretty fast when this
         is repeated enough, since skbs are orphaned, no socket limit
         can help in this situation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mark Gordon <msg@google.com>
      Cc: Andreas Terzis <aterzis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb66e
  31. 05 7月, 2012 1 次提交
  32. 04 7月, 2012 1 次提交
  33. 27 6月, 2012 1 次提交