1. 02 4月, 2013 12 次提交
    • J
      ipvs: do not expect result from done_service · ed3ffc4e
      Julian Anastasov 提交于
      This method releases the scheduler state,
      it can not fail. Such change will help to properly
      replace the scheduler in following patch.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ed3ffc4e
    • J
      ipvs: reorganize dest trash · 578bc3ef
      Julian Anastasov 提交于
      All dests will go to trash, no exceptions.
      But we have to use new list node t_list for this, due
      to RCU changes in following patches. Dests will wait there
      initial grace period and later all conns and schedulers to
      put their reference. The dests don't get reference for
      staying in dest trash as before.
      
      	As result, we do not load ip_vs_dest_put with
      extra checks for last refcnt and the schedulers do not
      need to play games with atomic_inc_not_zero while
      selecting best destination.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      578bc3ef
    • J
      ipvs: add ip_vs_dest_hold and ip_vs_dest_put · fca9c20a
      Julian Anastasov 提交于
      ip_vs_dest_hold will be used under RCU lock
      while ip_vs_dest_put can be called even after dest
      is removed from service, as it happens for conns and
      some schedulers.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      fca9c20a
    • J
      ipvs: preparations for using rcu in schedulers · 6b6df466
      Julian Anastasov 提交于
      Allow schedulers to use rcu_dereference when
      returning destination on lookup. The RCU read-side critical
      section will allow ip_vs_bind_dest to get dest refcnt as
      preparation for the step where destinations will be
      deleted without an IP_VS_WAIT_WHILE guard that holds the
      packet processing during update.
      
      	Add new optional scheduler methods add_dest,
      del_dest and upd_dest. For now the methods are called
      together with update_service but update_service will be
      removed in a following change.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      6b6df466
    • J
      ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new · 9a05475c
      Julian Anastasov 提交于
      We have many fields to set and few to reset,
      use kmem_cache_alloc instead to save some cycles.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      9a05475c
    • J
      ipvs: reorder keys in connection structure · 1845ed0b
      Julian Anastasov 提交于
      __ip_vs_conn_in_get and ip_vs_conn_out_get are
      hot places. Optimize them, so that ports are matched first.
      By moving net and fwmark below, on 32-bit arch we can fit
      caddr in 32-byte cache line and all addresses in 64-byte
      cache line.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      1845ed0b
    • J
      ipvs: convert connection locking · 088339a5
      Julian Anastasov 提交于
      Convert __ip_vs_conntbl_lock_array as follows:
      
      - readers that do not modify conn lists will use RCU lock
      - updaters that modify lists will use spinlock_t
      
      Now for conn lookups we will use RCU read-side
      critical section. Without using __ip_vs_conn_get such
      places have access to connection fields and can
      dereference some pointers like pe and pe_data plus
      the ability to update timer expiration. If full access
      is required we contend for reference.
      
      We add barrier in __ip_vs_conn_put, so that
      other CPUs see the refcnt operation after other writes.
      
      With the introduction of ip_vs_conn_unlink()
      we try to reorganize ip_vs_conn_expire(), so that
      unhashing of connections that should stay more time is
      avoided, even if it is for very short time.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      088339a5
    • J
      ipvs: remove rs_lock by using RCU · 276472ea
      Julian Anastasov 提交于
      rs_lock was used to protect rs_table (hash table)
      from updaters (under global mutex) and readers (packet handlers).
      We can remove rs_lock by using RCU lock for readers. Reclaiming
      dest only with kfree_rcu is enough because the readers access
      only fields from the ip_vs_dest structure.
      
      Use hlist for rs_table.
      
      As we are now using hlist_del_rcu, introduce in_rs_table
      flag as replacement for the list_empty checks which do not
      work with RCU. It is needed because only NAT dests are in
      the rs_table.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      276472ea
    • J
      ipvs: convert app locks · 363c97d7
      Julian Anastasov 提交于
      We use locks like tcp_app_lock, udp_app_lock,
      sctp_app_lock to protect access to the protocol hash tables
      from readers in packet context while the application
      instances (inc) are [un]registered under global mutex.
      
      As the hash tables are mostly read when conns are
      created and bound to app, use RCU for readers and reclaim
      app instance after grace period.
      
      Simplify ip_vs_app_inc_get because we use usecnt
      only for statistics and rely on module refcounting.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      363c97d7
    • J
      ipvs: optimize dst usage for real server · 026ace06
      Julian Anastasov 提交于
      Currently when forwarding requests to real servers
      we use dst_lock and atomic operations when cloning the
      dst_cache value. As the dst_cache value does not change
      most of the time it is better to use RCU and to lock
      dst_lock only when we need to replace the obsoleted dst.
      For this to work we keep dst_cache in new structure protected
      by RCU. For packets to remote real servers we will use noref
      version of dst_cache, it will be valid while we are in RCU
      read-side critical section because now dst_release for replaced
      dsts will be invoked after the grace period. Packets to
      local real servers that are passed to local stack with
      NF_ACCEPT need a dst clone.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      026ace06
    • J
      ipvs: rename functions related to dst_cache reset · d1deae4d
      Julian Anastasov 提交于
      Move and give better names to two functions:
      
      - ip_vs_dst_reset to __ip_vs_dst_cache_reset
      - __ip_vs_dev_reset to ip_vs_forget_dev
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      d1deae4d
    • J
      ipvs: avoid routing by TOS for real server · c90558da
      Julian Anastasov 提交于
      Avoid replacing the cached route for real server
      on every packet with different TOS. I doubt that routing
      by TOS for real server is used at all, so we should be
      better with such optimization.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c90558da
  2. 28 3月, 2013 1 次提交
  3. 27 3月, 2013 3 次提交
    • P
      ipv4: Fix ip-header identification for gso packets. · 330305cc
      Pravin B Shelar 提交于
      ip-header id needs to be incremented even if IP_DF flag is set.
      This behaviour was changed in commit 490ab081
      (IP_GRE: Fix IP-Identification).
      
      Following patch fixes it so that identification is always
      incremented.
      Reported-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330305cc
    • Y
      firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection. · 6752c8db
      YOSHIFUJI Hideaki / 吉藤英明 提交于
      Inspection of upper layer protocol is considered harmful, especially
      if it is about ARP or other stateful upper layer protocol; driver
      cannot (and should not) have full state of them.
      
      IPv4 over Firewire module used to inspect ARP (both in sending path
      and in receiving path), and record peer's GUID, max packet size, max
      speed and fifo address.  This patch removes such inspection by extending
      our "hardware address" definition to include other information as well:
      max packet size, max speed and fifo.  By doing this, The neighbour
      module in networking subsystem can cache them.
      
      Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
            information will not be used in the driver when sending.
      
      When a packet is being sent, the IP layer fills our pseudo header with
      the extended "hardware address", including GUID and fifo.  The driver
      can look-up node-id (the real but rather volatile low-level address)
      by GUID, and then the module can send the packet to the wire using
      parameters provided in the extendedn hardware address.
      
      This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
      over IEEE1394 (RFC3146) share same "hardware address" format
      in their address resolution protocols.
      
      Here, extended "hardware address" is defined as follows:
      
      union fwnet_hwaddr {
      	u8 u[16];
      	struct {
      		__be64 uniq_id;		/* EUI-64			*/
      		u8 max_rec;		/* max packet size		*/
      		u8 sspd;		/* max speed			*/
      		__be16 fifo_hi;		/* hi 16bits of FIFO addr	*/
      		__be32 fifo_lo;		/* lo 32bits of FIFO addr	*/
      	} __packed uc;
      };
      
      Note that Hardware address is declared as union, so that we can map full
      IP address into this, when implementing MCAP (Multicast Cannel Allocation
      Protocol) for IPv6, but IP and ARP subsystem do not need to know this
      format in detail.
      
      One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
      is that 1394 ARP Request/Reply do not contain the target hardware address
      field (aka ar$tha).  This difference is handled in the ARP subsystem.
      
      CC: Stephan Gatzka <stephan.gatzka@gmail.com>
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6752c8db
    • P
      GRE: Refactor GRE tunneling code. · c5441932
      Pravin B Shelar 提交于
      Following patch refactors GRE code into ip tunneling code and GRE
      specific code. Common tunneling code is moved to ip_tunnel module.
      ip_tunnel module is written as generic library which can be used
      by different tunneling implementations.
      
      ip_tunnel module contains following components:
       - packet xmit and rcv generic code. xmit flow looks like
         (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
       - hash table of all devices.
       - lookup for tunnel devices.
       - control plane operations like device create, destroy, ioctl, netlink
         operations code.
       - registration for tunneling modules, like gre, ipip etc.
       - define single pcpu_tstats dev->tstats.
       - struct tnl_ptk_info added to pass parsed tunnel packet parameters.
      
      ipip.h header is renamed to ip_tunnel.h
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5441932
  4. 26 3月, 2013 1 次提交
  5. 25 3月, 2013 3 次提交
  6. 22 3月, 2013 2 次提交
  7. 21 3月, 2013 2 次提交
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
    • D
      flow_keys: include thoff into flow_keys for later usage · 8ed78166
      Daniel Borkmann 提交于
      In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
      doing the work here anyway, also store thoff for a later usage, e.g. in
      the BPF filter.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed78166
  8. 20 3月, 2013 1 次提交
  9. 19 3月, 2013 3 次提交
    • H
      inet: limit length of fragment queue hash table bucket lists · 5a3da1fe
      Hannes Frederic Sowa 提交于
      This patch introduces a constant limit of the fragment queue hash
      table bucket list lengths. Currently the limit 128 is choosen somewhat
      arbitrary and just ensures that we can fill up the fragment cache with
      empty packets up to the default ip_frag_high_thresh limits. It should
      just protect from list iteration eating considerable amounts of cpu.
      
      If we reach the maximum length in one hash bucket a warning is printed.
      This is implemented on the caller side of inet_frag_find to distinguish
      between the different users of inet_fragment.c.
      
      I dropped the out of memory warning in the ipv4 fragment lookup path,
      because we already get a warning by the slab allocator.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a3da1fe
    • J
      ipvs: add backup_only flag to avoid loops · 0c12582f
      Julian Anastasov 提交于
      Dmitry Akindinov is reporting for a problem where SYNs are looping
      between the master and backup server when the backup server is used as
      real server in DR mode and has IPVS rules to function as director.
      
      Even when the backup function is enabled we continue to forward
      traffic and schedule new connections when the current master is using
      the backup server as real server. While this is not a problem for NAT,
      for DR and TUN method the backup server can not determine if a request
      comes from client or from director.
      
      To avoid such loops add new sysctl flag backup_only. It can be needed
      for DR/TUN setups that do not need backup and director function at the
      same time. When the backup function is enabled we stop any forwarding
      and pass the traffic to the local stack (real server mode). The flag
      disables the director function when the backup function is enabled.
      
      For setups that enable backup function for some virtual services and
      director function for other virtual services there should be another
      more complex solution to support DR/TUN mode, may be to assign
      per-virtual service syncid value, so that we can differentiate the
      requests.
      Reported-by: NDmitry Akindinov <dimak@stalker.com>
      Tested-by: NGerman Myzovsky <lawyer@sipnet.ru>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      0c12582f
    • J
      ipvs: fix some sparse warnings · b962abdc
      Julian Anastasov 提交于
      Add missing __percpu annotations and make ip_vs_net_id static.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      b962abdc
  10. 18 3月, 2013 2 次提交
  11. 15 3月, 2013 1 次提交
  12. 13 3月, 2013 1 次提交
    • D
      ipv4: fix definition of FIB_TABLE_HASHSZ · 5b9e12db
      Denis V. Lunev 提交于
      a long time ago by the commit
      
        commit 93456b6d
        Author: Denis V. Lunev <den@openvz.org>
        Date:   Thu Jan 10 03:23:38 2008 -0800
      
          [IPV4]: Unify access to the routing tables.
      
      the defenition of FIB_HASH_TABLE size has obtained wrong dependency:
      it should depend upon CONFIG_IP_MULTIPLE_TABLES (as was in the original
      code) but it was depended from CONFIG_IP_ROUTE_MULTIPATH
      
      This patch returns the situation to the original state.
      
      The problem was spotted by Tingwei Liu.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Tingwei Liu <tingw.liu@gmail.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b9e12db
  13. 12 3月, 2013 1 次提交
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  14. 11 3月, 2013 1 次提交
  15. 10 3月, 2013 1 次提交
  16. 09 3月, 2013 1 次提交
  17. 08 3月, 2013 2 次提交
  18. 06 3月, 2013 2 次提交