1. 25 2月, 2015 1 次提交
  2. 09 2月, 2015 1 次提交
    • J
      ipvs: use 64-bit rates in stats · cd67cd5e
      Julian Anastasov 提交于
      IPVS stats are limited to 2^(32-10) conns/s and packets/s,
      2^(32-5) bytes/s. It is time to use 64 bits:
      
      * Change all conn/packet kernel counters to 64-bit and update
      them in u64_stats_update_{begin,end} section
      
      * In kernel use struct ip_vs_kstats instead of the user-space
      struct ip_vs_stats_user and use new func ip_vs_export_stats_user
      to export it to sockopt users to preserve compatibility with
      32-bit values
      
      * Rename cpu counters "ustats" to "cnt"
      
      * To netlink users provide additionally 64-bit stats:
      IPVS_SVC_ATTR_STATS64 and IPVS_DEST_ATTR_STATS64. Old stats
      remain for old binaries.
      
      * We can use ip_vs_copy_stats in ip_vs_stats_percpu_show
      
      Thanks to Chris Caputo for providing initial patch for ip_vs_est.c
      Signed-off-by: NChris Caputo <ccaputo@alt.net>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      cd67cd5e
  3. 03 10月, 2014 1 次提交
  4. 16 9月, 2014 4 次提交
  5. 18 4月, 2014 1 次提交
  6. 11 11月, 2013 1 次提交
    • J
      netfilter: push reasm skb through instead of original frag skbs · 6aafeef0
      Jiri Pirko 提交于
      Pushing original fragments through causes several problems. For example
      for matching, frags may not be matched correctly. Take following
      example:
      
      <example>
      On HOSTA do:
      ip6tables -I INPUT -p icmpv6 -j DROP
      ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT
      
      and on HOSTB you do:
      ping6 HOSTA -s2000    (MTU is 1500)
      
      Incoming echo requests will be filtered out on HOSTA. This issue does
      not occur with smaller packets than MTU (where fragmentation does not happen)
      </example>
      
      As was discussed previously, the only correct solution seems to be to use
      reassembled skb instead of separete frags. Doing this has positive side
      effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams
      dances in ipvs and conntrack can be removed.
      
      Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c
      entirely and use code in net/ipv6/reassembly.c instead.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aafeef0
  7. 15 10月, 2013 1 次提交
    • J
      ipvs: avoid rcu_barrier during netns cleanup · 9e4e948a
      Julian Anastasov 提交于
      commit 578bc3ef ("ipvs: reorganize dest trash") added
      rcu_barrier() on cleanup to wait dest users and schedulers
      like LBLC and LBLCR to put their last dest reference.
      Using rcu_barrier with many namespaces is problematic.
      
      Trying to fix it by freeing dest with kfree_rcu is not
      a solution, RCU callbacks can run in parallel and execution
      order is random.
      
      Fix it by creating new function ip_vs_dest_put_and_free()
      which is heavier than ip_vs_dest_put(). We will use it just
      for schedulers like LBLC, LBLCR that can delay their dest
      release.
      
      By default, dests reference is above 0 if they are present in
      service and it is 0 when deleted but still in trash list.
      Change the dest trash code to use ip_vs_dest_put_and_free(),
      so that refcnt -1 can be used for freeing. As result,
      such checks remain in slow path and the rcu_barrier() from
      netns cleanup can be removed.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      9e4e948a
  8. 22 9月, 2013 1 次提交
  9. 19 9月, 2013 2 次提交
    • J
      ipvs: make the service replacement more robust · bcbde4c0
      Julian Anastasov 提交于
      commit 578bc3ef ("ipvs: reorganize dest trash") added
      IP_VS_DEST_STATE_REMOVING flag and RCU callback named
      ip_vs_dest_wait_readers() to keep dests and services after
      removal for at least a RCU grace period. But we have the
      following corner cases:
      
      - we can not reuse the same dest if its service is removed
      while IP_VS_DEST_STATE_REMOVING is still set because another dest
      removal in the first grace period can not extend this period.
      It can happen when ipvsadm -C && ipvsadm -R is used.
      
      - dest->svc can be replaced but ip_vs_in_stats() and
      ip_vs_out_stats() have no explicit read memory barriers
      when accessing dest->svc. It can happen that dest->svc
      was just freed (replaced) while we use it to update
      the stats.
      
      We solve the problems as follows:
      
      - IP_VS_DEST_STATE_REMOVING is removed and we ensure a fixed
      idle period for the dest (IP_VS_DEST_TRASH_PERIOD). idle_start
      will remember when for first time after deletion we noticed
      dest->refcnt=0. Later, the connections can grab a reference
      while in RCU grace period but if refcnt becomes 0 we can
      safely free the dest and its svc.
      
      - dest->svc becomes RCU pointer. As result, we add explicit
      RCU locking in ip_vs_in_stats() and ip_vs_out_stats().
      
      - __ip_vs_unbind_svc is renamed to __ip_vs_svc_put(), it
      now can free the service immediately or after a RCU grace
      period. dest->svc is not set to NULL anymore.
      
      	As result, unlinked dests and their services are
      freed always after IP_VS_DEST_TRASH_PERIOD period, unused
      services are freed after a RCU grace period.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      bcbde4c0
    • S
      ipvs: fix overflow on dest weight multiply · c16526a7
      Simon Kirby 提交于
      Schedulers such as lblc and lblcr require the weight to be as high as the
      maximum number of active connections. In commit b552f7e3
      ("ipvs: unify the formula to estimate the overhead of processing
      connections"), the consideration of inactconns and activeconns was cleaned
      up to always count activeconns as 256 times more important than inactconns.
      In cases where 3000 or more connections are expected, a weight of 3000 *
      256 * 3000 connections overflows the 32-bit signed result used to determine
      if rescheduling is required.
      
      On amd64, this merely changes the multiply and comparison instructions to
      64-bit. On x86, a 64-bit result is already present from imull, so only
      a few more comparison instructions are emitted.
      Signed-off-by: NSimon Kirby <sim@hostway.ca>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c16526a7
  10. 26 6月, 2013 4 次提交
  11. 26 5月, 2013 1 次提交
  12. 23 4月, 2013 1 次提交
  13. 02 4月, 2013 15 次提交
    • J
      ipvs: convert services to rcu · ceec4c38
      Julian Anastasov 提交于
      This is the final step in RCU conversion.
      
      Things that are removed:
      
      - svc->usecnt: now svc is accessed under RCU read lock
      - svc->inc: and some unused code
      - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
      - __ip_vs_svc_lock: replaced with RCU
      - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
      	RCU and work in parallel with configuration
      
      Other changes:
      
      - before now, a RCU read-side critical section included the
      calling of the schedule method, now it is extended to include
      service lookup
      - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
      - svc->pe and svc->scheduler remain to the end (of grace period),
      	the schedulers are prepared for such RCU readers
      	even after done_service is called but they need
      	to use synchronize_rcu because last ip_vs_scheduler_put
      	can happen while RCU read-side critical sections
      	use an outdated svc->scheduler pointer
      - as planned, update_service is removed
      - empty services can be freed immediately after grace period.
      	If dests were present, the services are freed from
      	the dest trash code
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ceec4c38
    • J
      ipvs: convert dests to rcu · 413c2d04
      Julian Anastasov 提交于
      In previous commits the schedulers started to access
      svc->destinations with _rcu list traversal primitives
      because the IP_VS_WAIT_WHILE macro still plays the role of
      grace period. Now it is time to finish the updating part,
      i.e. adding and deleting of dests with _rcu suffix before
      removing the IP_VS_WAIT_WHILE in next commit.
      
      We use the same rule for conns as for the
      schedulers: dests can be searched in RCU read-side critical
      section where ip_vs_dest_hold can be called by ip_vs_bind_dest.
      
      Some things are not perfect, for example, calling
      functions like ip_vs_lookup_dest from updating code under
      RCU, just because we use some function both from reader
      and from updater.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      413c2d04
    • J
      ipvs: convert sched_lock to spin lock · ba3a3ce1
      Julian Anastasov 提交于
      As all read_locks are gone spin lock is preferred.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ba3a3ce1
    • J
      ipvs: do not expect result from done_service · ed3ffc4e
      Julian Anastasov 提交于
      This method releases the scheduler state,
      it can not fail. Such change will help to properly
      replace the scheduler in following patch.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ed3ffc4e
    • J
      ipvs: reorganize dest trash · 578bc3ef
      Julian Anastasov 提交于
      All dests will go to trash, no exceptions.
      But we have to use new list node t_list for this, due
      to RCU changes in following patches. Dests will wait there
      initial grace period and later all conns and schedulers to
      put their reference. The dests don't get reference for
      staying in dest trash as before.
      
      	As result, we do not load ip_vs_dest_put with
      extra checks for last refcnt and the schedulers do not
      need to play games with atomic_inc_not_zero while
      selecting best destination.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      578bc3ef
    • J
      ipvs: add ip_vs_dest_hold and ip_vs_dest_put · fca9c20a
      Julian Anastasov 提交于
      ip_vs_dest_hold will be used under RCU lock
      while ip_vs_dest_put can be called even after dest
      is removed from service, as it happens for conns and
      some schedulers.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      fca9c20a
    • J
      ipvs: preparations for using rcu in schedulers · 6b6df466
      Julian Anastasov 提交于
      Allow schedulers to use rcu_dereference when
      returning destination on lookup. The RCU read-side critical
      section will allow ip_vs_bind_dest to get dest refcnt as
      preparation for the step where destinations will be
      deleted without an IP_VS_WAIT_WHILE guard that holds the
      packet processing during update.
      
      	Add new optional scheduler methods add_dest,
      del_dest and upd_dest. For now the methods are called
      together with update_service but update_service will be
      removed in a following change.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      6b6df466
    • J
      ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new · 9a05475c
      Julian Anastasov 提交于
      We have many fields to set and few to reset,
      use kmem_cache_alloc instead to save some cycles.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      9a05475c
    • J
      ipvs: reorder keys in connection structure · 1845ed0b
      Julian Anastasov 提交于
      __ip_vs_conn_in_get and ip_vs_conn_out_get are
      hot places. Optimize them, so that ports are matched first.
      By moving net and fwmark below, on 32-bit arch we can fit
      caddr in 32-byte cache line and all addresses in 64-byte
      cache line.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      1845ed0b
    • J
      ipvs: convert connection locking · 088339a5
      Julian Anastasov 提交于
      Convert __ip_vs_conntbl_lock_array as follows:
      
      - readers that do not modify conn lists will use RCU lock
      - updaters that modify lists will use spinlock_t
      
      Now for conn lookups we will use RCU read-side
      critical section. Without using __ip_vs_conn_get such
      places have access to connection fields and can
      dereference some pointers like pe and pe_data plus
      the ability to update timer expiration. If full access
      is required we contend for reference.
      
      We add barrier in __ip_vs_conn_put, so that
      other CPUs see the refcnt operation after other writes.
      
      With the introduction of ip_vs_conn_unlink()
      we try to reorganize ip_vs_conn_expire(), so that
      unhashing of connections that should stay more time is
      avoided, even if it is for very short time.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      088339a5
    • J
      ipvs: remove rs_lock by using RCU · 276472ea
      Julian Anastasov 提交于
      rs_lock was used to protect rs_table (hash table)
      from updaters (under global mutex) and readers (packet handlers).
      We can remove rs_lock by using RCU lock for readers. Reclaiming
      dest only with kfree_rcu is enough because the readers access
      only fields from the ip_vs_dest structure.
      
      Use hlist for rs_table.
      
      As we are now using hlist_del_rcu, introduce in_rs_table
      flag as replacement for the list_empty checks which do not
      work with RCU. It is needed because only NAT dests are in
      the rs_table.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      276472ea
    • J
      ipvs: convert app locks · 363c97d7
      Julian Anastasov 提交于
      We use locks like tcp_app_lock, udp_app_lock,
      sctp_app_lock to protect access to the protocol hash tables
      from readers in packet context while the application
      instances (inc) are [un]registered under global mutex.
      
      As the hash tables are mostly read when conns are
      created and bound to app, use RCU for readers and reclaim
      app instance after grace period.
      
      Simplify ip_vs_app_inc_get because we use usecnt
      only for statistics and rely on module refcounting.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      363c97d7
    • J
      ipvs: optimize dst usage for real server · 026ace06
      Julian Anastasov 提交于
      Currently when forwarding requests to real servers
      we use dst_lock and atomic operations when cloning the
      dst_cache value. As the dst_cache value does not change
      most of the time it is better to use RCU and to lock
      dst_lock only when we need to replace the obsoleted dst.
      For this to work we keep dst_cache in new structure protected
      by RCU. For packets to remote real servers we will use noref
      version of dst_cache, it will be valid while we are in RCU
      read-side critical section because now dst_release for replaced
      dsts will be invoked after the grace period. Packets to
      local real servers that are passed to local stack with
      NF_ACCEPT need a dst clone.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      026ace06
    • J
      ipvs: rename functions related to dst_cache reset · d1deae4d
      Julian Anastasov 提交于
      Move and give better names to two functions:
      
      - ip_vs_dst_reset to __ip_vs_dst_cache_reset
      - __ip_vs_dev_reset to ip_vs_forget_dev
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      d1deae4d
    • J
      ipvs: avoid routing by TOS for real server · c90558da
      Julian Anastasov 提交于
      Avoid replacing the cached route for real server
      on every packet with different TOS. I doubt that routing
      by TOS for real server is used at all, so we should be
      better with such optimization.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c90558da
  14. 19 3月, 2013 2 次提交
    • J
      ipvs: add backup_only flag to avoid loops · 0c12582f
      Julian Anastasov 提交于
      Dmitry Akindinov is reporting for a problem where SYNs are looping
      between the master and backup server when the backup server is used as
      real server in DR mode and has IPVS rules to function as director.
      
      Even when the backup function is enabled we continue to forward
      traffic and schedule new connections when the current master is using
      the backup server as real server. While this is not a problem for NAT,
      for DR and TUN method the backup server can not determine if a request
      comes from client or from director.
      
      To avoid such loops add new sysctl flag backup_only. It can be needed
      for DR/TUN setups that do not need backup and director function at the
      same time. When the backup function is enabled we stop any forwarding
      and pass the traffic to the local stack (real server mode). The flag
      disables the director function when the backup function is enabled.
      
      For setups that enable backup function for some virtual services and
      director function for other virtual services there should be another
      more complex solution to support DR/TUN mode, may be to assign
      per-virtual service syncid value, so that we can differentiate the
      requests.
      Reported-by: NDmitry Akindinov <dimak@stalker.com>
      Tested-by: NGerman Myzovsky <lawyer@sipnet.ru>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      0c12582f
    • J
      ipvs: fix some sparse warnings · b962abdc
      Julian Anastasov 提交于
      Add missing __percpu annotations and make ip_vs_net_id static.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      b962abdc
  15. 23 10月, 2012 1 次提交
  16. 28 9月, 2012 3 次提交
    • J
      ipvs: API change to avoid rescan of IPv6 exthdr · d4383f04
      Jesper Dangaard Brouer 提交于
      Reduce the number of times we scan/skip the IPv6 exthdrs.
      
      This patch contains a lot of API changes.  This is done, to avoid
      repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
      which is called by ip_vs_fill_iph_skb().
      
      Finding the IPv6 headers is done as early as possible, and passed on
      as a pointer "struct ip_vs_iphdr *" to the affected functions.
      
      This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
      
      Notice, I have choosen, not to change the API of function
      pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
      used by external schedulers, via {un,}register_ip_vs_scheduler.
      Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
      they do, they are only interested in iph->{s,d}addr.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      d4383f04
    • J
      ipvs: Complete IPv6 fragment handling for IPVS · 2f74713d
      Jesper Dangaard Brouer 提交于
      IPVS now supports fragmented packets, with support from nf_conntrack_reasm.c
      
      Based on patch from: Hans Schillstrom.
      
      IPVS do like conntrack i.e. use the skb->nfct_reasm
      (i.e. when all fragments is collected, nf_ct_frag6_output()
      starts a "re-play" of all fragments into the interrupted
      PREROUTING chain at prio -399 (NF_IP6_PRI_CONNTRACK_DEFRAG+1)
      with nfct_reasm pointing to the assembled packet.)
      
      Notice, module nf_defrag_ipv6 must be loaded for this to work.
      Report unhandled fragments, and recommend user to load nf_defrag_ipv6.
      
      To handle fw-mark for fragments.  Add a new IPVS hook into prerouting
      chain at prio -99 (NF_IP6_PRI_NAT_DST+1) to catch fragments, and copy
      fw-mark info from the first packet with an upper layer header.
      
      IPv6 fragment handling should be the last thing on the IPVS IPv6
      missing support list.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NHans Schillstrom <hans@schillstrom.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      2f74713d
    • J
      ipvs: Fix faulty IPv6 extension header handling in IPVS · 63dca2c0
      Jesper Dangaard Brouer 提交于
      IPv6 packets can contain extension headers, thus its wrong to assume
      that the transport/upper-layer header, starts right after (struct
      ipv6hdr) the IPv6 header.  IPVS uses this false assumption, and will
      write SNAT & DNAT modifications at a fixed pos which will corrupt the
      message.
      
      To fix this, proper header position must be found before modifying
      packets.  Introducing ip_vs_fill_iph_skb(), which uses ipv6_find_hdr()
      to skip the exthdrs. It finds (1) the transport header offset, (2) the
      protocol, and (3) detects if the packet is a fragment.
      
      Note, that fragments in IPv6 is represented via an exthdr.  Thus, this
      is detected while skipping through the exthdrs.
      
      This patch depends on commit 84018f55:
       "netfilter: ip6_tables: add flags parameter to ipv6_find_hdr()"
      This also adds a dependency to ip6_tables.
      
      Originally based on patch from: Hans Schillstrom
      
      kABI notes:
      Changing struct ip_vs_iphdr is a potential minor kABI breaker,
      because external modules can be compiled with another version of
      this struct.  This should not matter, as they would most-likely
      be using a compiled-in version of ip_vs_fill_iphdr().  When
      recompiled, they will notice ip_vs_fill_iphdr() no longer exists,
      and they have to used ip_vs_fill_iph_skb() instead.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      63dca2c0