1. 23 3月, 2013 2 次提交
  2. 22 3月, 2013 7 次提交
    • E
      tcp: preserve ACK clocking in TSO · f4541d60
      Eric Dumazet 提交于
      A long standing problem with TSO is the fact that tcp_tso_should_defer()
      rearms the deferred timer, while it should not.
      
      Current code leads to following bad bursty behavior :
      
      20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119
      20:11:24.484337 IP B > A: . ack 263721 win 1117
      20:11:24.485086 IP B > A: . ack 265241 win 1117
      20:11:24.485925 IP B > A: . ack 266761 win 1117
      20:11:24.486759 IP B > A: . ack 268281 win 1117
      20:11:24.487594 IP B > A: . ack 269801 win 1117
      20:11:24.488430 IP B > A: . ack 271321 win 1117
      20:11:24.489267 IP B > A: . ack 272841 win 1117
      20:11:24.490104 IP B > A: . ack 274361 win 1117
      20:11:24.490939 IP B > A: . ack 275881 win 1117
      20:11:24.491775 IP B > A: . ack 277401 win 1117
      20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119
      20:11:24.492620 IP B > A: . ack 278921 win 1117
      20:11:24.493448 IP B > A: . ack 280441 win 1117
      20:11:24.494286 IP B > A: . ack 281961 win 1117
      20:11:24.495122 IP B > A: . ack 283481 win 1117
      20:11:24.495958 IP B > A: . ack 285001 win 1117
      20:11:24.496791 IP B > A: . ack 286521 win 1117
      20:11:24.497628 IP B > A: . ack 288041 win 1117
      20:11:24.498459 IP B > A: . ack 289561 win 1117
      20:11:24.499296 IP B > A: . ack 291081 win 1117
      20:11:24.500133 IP B > A: . ack 292601 win 1117
      20:11:24.500970 IP B > A: . ack 294121 win 1117
      20:11:24.501388 IP B > A: . ack 295641 win 1117
      20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119
      
      While the expected behavior is more like :
      
      20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119
      20:19:49.260446 IP B > A: . ack 154281 win 1212
      20:19:49.261282 IP B > A: . ack 155801 win 1212
      20:19:49.262125 IP B > A: . ack 157321 win 1212
      20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119
      20:19:49.262958 IP B > A: . ack 158841 win 1212
      20:19:49.263795 IP B > A: . ack 160361 win 1212
      20:19:49.264628 IP B > A: . ack 161881 win 1212
      20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119
      20:19:49.265465 IP B > A: . ack 163401 win 1212
      20:19:49.265886 IP B > A: . ack 164921 win 1212
      20:19:49.266722 IP B > A: . ack 166441 win 1212
      20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119
      20:19:49.267559 IP B > A: . ack 167961 win 1212
      20:19:49.268394 IP B > A: . ack 169481 win 1212
      20:19:49.269232 IP B > A: . ack 171001 win 1212
      20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4541d60
    • T
      rtnetlink: Remove passing of attributes into rtnl_doit functions · 661d2967
      Thomas Graf 提交于
      With decnet converted, we can finally get rid of rta_buf and its
      computations around it. It also gets rid of the minimal header
      length verification since all message handlers do that explicitly
      anyway.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      661d2967
    • T
      decnet: Parse netlink attributes on our own · 58d7d8f9
      Thomas Graf 提交于
      decnet is the only subsystem left that is relying on the global
      netlink attribute buffer rta_buf. It's horrible design and we
      want to get rid of it.
      
      This converts all of decnet to do implicit attribute parsing. It
      also gets rid of the error prone struct dn_kern_rta.
      
      Yes, the fib_magic() stuff is not pretty.
      
      It's compiled tested but I need someone with appropriate hardware
      to test the patch since I don't have access to it.
      
      Cc: linux-decnet-user@lists.sourceforge.net
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58d7d8f9
    • C
      udp: increase inner ip header ID during segmentation · d6a8c36d
      Cong Wang 提交于
      Similar to GRE tunnel, UDP tunnel should take care of IP header ID
      too.
      
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6a8c36d
    • C
      ip_gre: increase inner ip header ID during segmentation · 10c0d7ed
      Cong Wang 提交于
      According to the previous discussion [1] on netdev list, DaveM insists
      we should increase the IP header ID for each segmented packets.
      This patch fixes it.
      
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      
      1. http://marc.info/?t=136384172700001&r=1&w=2Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10c0d7ed
    • A
      netlink: Diag core and basic socket info dumping (v2) · eaaa3139
      Andrey Vagin 提交于
      The netlink_diag can be built as a module, just like it's done in
      unix sockets.
      
      The core dumping message carries the basic info about netlink sockets:
      family, type and protocol, portis, dst_group, dst_portid, state.
      
      Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.
      
      Netlink sockets cab be filtered by protocols.
      
      The socket inode number and cookie is reserved for future per-socket info
      retrieving. The per-protocol filtering is also reserved for future by
      requiring the sdiag_protocol to be zero.
      
      The file /proc/net/netlink doesn't provide enough information for
      dumping netlink sockets. It doesn't provide dst_group, dst_portid,
      groups above 32.
      
      v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaaa3139
    • A
      net: prepare netlink code for netlink diag · 0f29c768
      Andrey Vagin 提交于
      Move a few declarations in a header.
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f29c768
  3. 21 3月, 2013 23 次提交
  4. 20 3月, 2013 2 次提交
    • P
      netfilter: remove unused "config IP_NF_QUEUE" · 3dd6664f
      Paul Bolle 提交于
      Kconfig symbol IP_NF_QUEUE is unused since commit
      d16cf20e ("netfilter: remove ip_queue
      support"). Let's remove it too.
      Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3dd6664f
    • W
      packet: packet fanout rollover during socket overload · 77f65ebd
      Willem de Bruijn 提交于
      Changes:
        v3->v2: rebase (no other changes)
                passes selftest
        v2->v1: read f->num_members only once
                fix bug: test rollover mode + flag
      
      Minimize packet drop in a fanout group. If one socket is full,
      roll over packets to another from the group. Maintain flow
      affinity during normal load using an rxhash fanout policy, while
      dispersing unexpected traffic storms that hit a single cpu, such
      as spoofed-source DoS flows. Rollover breaks affinity for flows
      arriving at saturated sockets during those conditions.
      
      The patch adds a fanout policy ROLLOVER that rotates between sockets,
      filling each socket before moving to the next. It also adds a fanout
      flag ROLLOVER. If passed along with any other fanout policy, the
      primary policy is applied until the chosen socket is full. Then,
      rollover selects another socket, to delay packet drop until the
      entire system is saturated.
      
      Probing sockets is not free. Selecting the last used socket, as
      rollover does, is a greedy approach that maximizes chance of
      success, at the cost of extreme load imbalance. In practice, with
      sufficiently long queues to absorb bursts, sockets are drained in
      parallel and load balance looks uniform in `top`.
      
      To avoid contention, scales counters with number of sockets and
      accesses them lockfree. Values are bounds checked to ensure
      correctness.
      
      Tested using an application with 9 threads pinned to CPUs, one socket
      per thread and sufficient busywork per packet operation to limits each
      thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
      packets, a FANOUT_CPU setup processes 32 Kpps in total without this
      patch, 270 Kpps with the patch. Tested with read() and with a packet
      ring (V1).
      
      Also, passes psock_fanout.c unit test added to selftests.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77f65ebd
  5. 19 3月, 2013 6 次提交
    • B
      xfrm: use xfrm direction when lookup policy · b5fb82c4
      Baker Zhang 提交于
      because xfrm policy direction has same value with corresponding
      flow direction, so this problem is covered.
      
      In xfrm_lookup and __xfrm_policy_check, flow_cache_lookup is used to
      accelerate the lookup.
      
      Flow direction is given to flow_cache_lookup by policy_to_flow_dir.
      
      When the flow cache is mismatched, callback 'resolver' is called.
      
      'resolver' requires xfrm direction,
      so convert direction back to xfrm direction.
      Signed-off-by: NBaker Zhang <baker.zhang@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5fb82c4
    • H
      inet: limit length of fragment queue hash table bucket lists · 5a3da1fe
      Hannes Frederic Sowa 提交于
      This patch introduces a constant limit of the fragment queue hash
      table bucket list lengths. Currently the limit 128 is choosen somewhat
      arbitrary and just ensures that we can fill up the fragment cache with
      empty packets up to the default ip_frag_high_thresh limits. It should
      just protect from list iteration eating considerable amounts of cpu.
      
      If we reach the maximum length in one hash bucket a warning is printed.
      This is implemented on the caller side of inet_frag_find to distinguish
      between the different users of inet_fragment.c.
      
      I dropped the out of memory warning in the ipv4 fragment lookup path,
      because we already get a warning by the slab allocator.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a3da1fe
    • O
      can: dump stack on protocol bugs · c9bbb75f
      Oliver Hartkopp 提交于
      The rework of the kernel hlist implementation "hlist: drop the node parameter
      from iterators" (b67bfe0d) created some
      fallout in the form of non matching comments and obsolete code.
      
      Additionally to the cleanup this patch adds a WARN() statement to catch the
      caller of the wrong filter removal request.
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bbb75f
    • J
      ipvs: remove extra rcu lock · bf93ad72
      Julian Anastasov 提交于
      In 3.7 we added code that uses ipv4_update_pmtu but after commit
      c5ae7d41 (ipv4: must use rcu protection while calling fib_lookup)
      the RCU lock is not needed.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      bf93ad72
    • J
      ipvs: add backup_only flag to avoid loops · 0c12582f
      Julian Anastasov 提交于
      Dmitry Akindinov is reporting for a problem where SYNs are looping
      between the master and backup server when the backup server is used as
      real server in DR mode and has IPVS rules to function as director.
      
      Even when the backup function is enabled we continue to forward
      traffic and schedule new connections when the current master is using
      the backup server as real server. While this is not a problem for NAT,
      for DR and TUN method the backup server can not determine if a request
      comes from client or from director.
      
      To avoid such loops add new sysctl flag backup_only. It can be needed
      for DR/TUN setups that do not need backup and director function at the
      same time. When the backup function is enabled we stop any forwarding
      and pass the traffic to the local stack (real server mode). The flag
      disables the director function when the backup function is enabled.
      
      For setups that enable backup function for some virtual services and
      director function for other virtual services there should be another
      more complex solution to support DR/TUN mode, may be to assign
      per-virtual service syncid value, so that we can differentiate the
      requests.
      Reported-by: NDmitry Akindinov <dimak@stalker.com>
      Tested-by: NGerman Myzovsky <lawyer@sipnet.ru>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      0c12582f
    • J
      ipvs: fix sctp chunk length order · cf2e3942
      Julian Anastasov 提交于
      Fix wrong but non-fatal access to chunk length.
      sch->length should be in network order, next chunk should
      be aligned to 4 bytes. Problem noticed in sparse output.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      cf2e3942