1. 26 5月, 2013 1 次提交
  2. 23 4月, 2013 1 次提交
  3. 02 4月, 2013 15 次提交
    • J
      ipvs: convert services to rcu · ceec4c38
      Julian Anastasov 提交于
      This is the final step in RCU conversion.
      
      Things that are removed:
      
      - svc->usecnt: now svc is accessed under RCU read lock
      - svc->inc: and some unused code
      - ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
      - __ip_vs_svc_lock: replaced with RCU
      - IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
      	RCU and work in parallel with configuration
      
      Other changes:
      
      - before now, a RCU read-side critical section included the
      calling of the schedule method, now it is extended to include
      service lookup
      - ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
      - svc->pe and svc->scheduler remain to the end (of grace period),
      	the schedulers are prepared for such RCU readers
      	even after done_service is called but they need
      	to use synchronize_rcu because last ip_vs_scheduler_put
      	can happen while RCU read-side critical sections
      	use an outdated svc->scheduler pointer
      - as planned, update_service is removed
      - empty services can be freed immediately after grace period.
      	If dests were present, the services are freed from
      	the dest trash code
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ceec4c38
    • J
      ipvs: convert dests to rcu · 413c2d04
      Julian Anastasov 提交于
      In previous commits the schedulers started to access
      svc->destinations with _rcu list traversal primitives
      because the IP_VS_WAIT_WHILE macro still plays the role of
      grace period. Now it is time to finish the updating part,
      i.e. adding and deleting of dests with _rcu suffix before
      removing the IP_VS_WAIT_WHILE in next commit.
      
      We use the same rule for conns as for the
      schedulers: dests can be searched in RCU read-side critical
      section where ip_vs_dest_hold can be called by ip_vs_bind_dest.
      
      Some things are not perfect, for example, calling
      functions like ip_vs_lookup_dest from updating code under
      RCU, just because we use some function both from reader
      and from updater.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      413c2d04
    • J
      ipvs: convert sched_lock to spin lock · ba3a3ce1
      Julian Anastasov 提交于
      As all read_locks are gone spin lock is preferred.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ba3a3ce1
    • J
      ipvs: do not expect result from done_service · ed3ffc4e
      Julian Anastasov 提交于
      This method releases the scheduler state,
      it can not fail. Such change will help to properly
      replace the scheduler in following patch.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      ed3ffc4e
    • J
      ipvs: reorganize dest trash · 578bc3ef
      Julian Anastasov 提交于
      All dests will go to trash, no exceptions.
      But we have to use new list node t_list for this, due
      to RCU changes in following patches. Dests will wait there
      initial grace period and later all conns and schedulers to
      put their reference. The dests don't get reference for
      staying in dest trash as before.
      
      	As result, we do not load ip_vs_dest_put with
      extra checks for last refcnt and the schedulers do not
      need to play games with atomic_inc_not_zero while
      selecting best destination.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      578bc3ef
    • J
      ipvs: add ip_vs_dest_hold and ip_vs_dest_put · fca9c20a
      Julian Anastasov 提交于
      ip_vs_dest_hold will be used under RCU lock
      while ip_vs_dest_put can be called even after dest
      is removed from service, as it happens for conns and
      some schedulers.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      fca9c20a
    • J
      ipvs: preparations for using rcu in schedulers · 6b6df466
      Julian Anastasov 提交于
      Allow schedulers to use rcu_dereference when
      returning destination on lookup. The RCU read-side critical
      section will allow ip_vs_bind_dest to get dest refcnt as
      preparation for the step where destinations will be
      deleted without an IP_VS_WAIT_WHILE guard that holds the
      packet processing during update.
      
      	Add new optional scheduler methods add_dest,
      del_dest and upd_dest. For now the methods are called
      together with update_service but update_service will be
      removed in a following change.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      6b6df466
    • J
      ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new · 9a05475c
      Julian Anastasov 提交于
      We have many fields to set and few to reset,
      use kmem_cache_alloc instead to save some cycles.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      9a05475c
    • J
      ipvs: reorder keys in connection structure · 1845ed0b
      Julian Anastasov 提交于
      __ip_vs_conn_in_get and ip_vs_conn_out_get are
      hot places. Optimize them, so that ports are matched first.
      By moving net and fwmark below, on 32-bit arch we can fit
      caddr in 32-byte cache line and all addresses in 64-byte
      cache line.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      1845ed0b
    • J
      ipvs: convert connection locking · 088339a5
      Julian Anastasov 提交于
      Convert __ip_vs_conntbl_lock_array as follows:
      
      - readers that do not modify conn lists will use RCU lock
      - updaters that modify lists will use spinlock_t
      
      Now for conn lookups we will use RCU read-side
      critical section. Without using __ip_vs_conn_get such
      places have access to connection fields and can
      dereference some pointers like pe and pe_data plus
      the ability to update timer expiration. If full access
      is required we contend for reference.
      
      We add barrier in __ip_vs_conn_put, so that
      other CPUs see the refcnt operation after other writes.
      
      With the introduction of ip_vs_conn_unlink()
      we try to reorganize ip_vs_conn_expire(), so that
      unhashing of connections that should stay more time is
      avoided, even if it is for very short time.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      088339a5
    • J
      ipvs: remove rs_lock by using RCU · 276472ea
      Julian Anastasov 提交于
      rs_lock was used to protect rs_table (hash table)
      from updaters (under global mutex) and readers (packet handlers).
      We can remove rs_lock by using RCU lock for readers. Reclaiming
      dest only with kfree_rcu is enough because the readers access
      only fields from the ip_vs_dest structure.
      
      Use hlist for rs_table.
      
      As we are now using hlist_del_rcu, introduce in_rs_table
      flag as replacement for the list_empty checks which do not
      work with RCU. It is needed because only NAT dests are in
      the rs_table.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      276472ea
    • J
      ipvs: convert app locks · 363c97d7
      Julian Anastasov 提交于
      We use locks like tcp_app_lock, udp_app_lock,
      sctp_app_lock to protect access to the protocol hash tables
      from readers in packet context while the application
      instances (inc) are [un]registered under global mutex.
      
      As the hash tables are mostly read when conns are
      created and bound to app, use RCU for readers and reclaim
      app instance after grace period.
      
      Simplify ip_vs_app_inc_get because we use usecnt
      only for statistics and rely on module refcounting.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      363c97d7
    • J
      ipvs: optimize dst usage for real server · 026ace06
      Julian Anastasov 提交于
      Currently when forwarding requests to real servers
      we use dst_lock and atomic operations when cloning the
      dst_cache value. As the dst_cache value does not change
      most of the time it is better to use RCU and to lock
      dst_lock only when we need to replace the obsoleted dst.
      For this to work we keep dst_cache in new structure protected
      by RCU. For packets to remote real servers we will use noref
      version of dst_cache, it will be valid while we are in RCU
      read-side critical section because now dst_release for replaced
      dsts will be invoked after the grace period. Packets to
      local real servers that are passed to local stack with
      NF_ACCEPT need a dst clone.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      026ace06
    • J
      ipvs: rename functions related to dst_cache reset · d1deae4d
      Julian Anastasov 提交于
      Move and give better names to two functions:
      
      - ip_vs_dst_reset to __ip_vs_dst_cache_reset
      - __ip_vs_dev_reset to ip_vs_forget_dev
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      d1deae4d
    • J
      ipvs: avoid routing by TOS for real server · c90558da
      Julian Anastasov 提交于
      Avoid replacing the cached route for real server
      on every packet with different TOS. I doubt that routing
      by TOS for real server is used at all, so we should be
      better with such optimization.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off by: Hans Schillstrom <hans@schillstrom.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c90558da
  4. 19 3月, 2013 2 次提交
    • J
      ipvs: add backup_only flag to avoid loops · 0c12582f
      Julian Anastasov 提交于
      Dmitry Akindinov is reporting for a problem where SYNs are looping
      between the master and backup server when the backup server is used as
      real server in DR mode and has IPVS rules to function as director.
      
      Even when the backup function is enabled we continue to forward
      traffic and schedule new connections when the current master is using
      the backup server as real server. While this is not a problem for NAT,
      for DR and TUN method the backup server can not determine if a request
      comes from client or from director.
      
      To avoid such loops add new sysctl flag backup_only. It can be needed
      for DR/TUN setups that do not need backup and director function at the
      same time. When the backup function is enabled we stop any forwarding
      and pass the traffic to the local stack (real server mode). The flag
      disables the director function when the backup function is enabled.
      
      For setups that enable backup function for some virtual services and
      director function for other virtual services there should be another
      more complex solution to support DR/TUN mode, may be to assign
      per-virtual service syncid value, so that we can differentiate the
      requests.
      Reported-by: NDmitry Akindinov <dimak@stalker.com>
      Tested-by: NGerman Myzovsky <lawyer@sipnet.ru>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      0c12582f
    • J
      ipvs: fix some sparse warnings · b962abdc
      Julian Anastasov 提交于
      Add missing __percpu annotations and make ip_vs_net_id static.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      b962abdc
  5. 23 10月, 2012 1 次提交
  6. 28 9月, 2012 5 次提交
  7. 10 8月, 2012 2 次提交
  8. 17 7月, 2012 1 次提交
    • L
      ipvs: fix oops on NAT reply in br_nf context · 9e33ce45
      Lin Ming 提交于
      IPVS should not reset skb->nf_bridge in FORWARD hook
      by calling nf_reset for NAT replies. It triggers oops in
      br_nf_forward_finish.
      
      [  579.781508] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [  579.781669] IP: [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
      [  579.781792] PGD 218f9067 PUD 0
      [  579.781865] Oops: 0000 [#1] SMP
      [  579.781945] CPU 0
      [  579.781983] Modules linked in:
      [  579.782047]
      [  579.782080]
      [  579.782114] Pid: 4644, comm: qemu Tainted: G        W    3.5.0-rc5-00006-g95e69f9 #282 Hewlett-Packard  /30E8
      [  579.782300] RIP: 0010:[<ffffffff817b1ca5>]  [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
      [  579.782455] RSP: 0018:ffff88007b003a98  EFLAGS: 00010287
      [  579.782541] RAX: 0000000000000008 RBX: ffff8800762ead00 RCX: 000000000001670a
      [  579.782653] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff8800762ead00
      [  579.782845] RBP: ffff88007b003ac8 R08: 0000000000016630 R09: ffff88007b003a90
      [  579.782957] R10: ffff88007b0038e8 R11: ffff88002da37540 R12: ffff88002da01a02
      [  579.783066] R13: ffff88002da01a80 R14: ffff88002d83c000 R15: ffff88002d82a000
      [  579.783177] FS:  0000000000000000(0000) GS:ffff88007b000000(0063) knlGS:00000000f62d1b70
      [  579.783306] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
      [  579.783395] CR2: 0000000000000004 CR3: 00000000218fe000 CR4: 00000000000027f0
      [  579.783505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  579.783684] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  579.783795] Process qemu (pid: 4644, threadinfo ffff880021b20000, task ffff880021aba760)
      [  579.783919] Stack:
      [  579.783959]  ffff88007693cedc ffff8800762ead00 ffff88002da01a02 ffff8800762ead00
      [  579.784110]  ffff88002da01a02 ffff88002da01a80 ffff88007b003b18 ffffffff817b26c7
      [  579.784260]  ffff880080000000 ffffffff81ef59f0 ffff8800762ead00 ffffffff81ef58b0
      [  579.784477] Call Trace:
      [  579.784523]  <IRQ>
      [  579.784562]
      [  579.784603]  [<ffffffff817b26c7>] br_nf_forward_ip+0x275/0x2c8
      [  579.784707]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
      [  579.784797]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
      [  579.784906]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
      [  579.784995]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
      [  579.785175]  [<ffffffff8187fa95>] ? _raw_write_unlock_bh+0x19/0x1b
      [  579.785179]  [<ffffffff817ac417>] __br_forward+0x97/0xa2
      [  579.785179]  [<ffffffff817ad366>] br_handle_frame_finish+0x1a6/0x257
      [  579.785179]  [<ffffffff817b2386>] br_nf_pre_routing_finish+0x26d/0x2cb
      [  579.785179]  [<ffffffff817b2cf0>] br_nf_pre_routing+0x55d/0x5c1
      [  579.785179]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
      [  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
      [  579.785179]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
      [  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
      [  579.785179]  [<ffffffff81551525>] ? sky2_poll+0xb35/0xb54
      [  579.785179]  [<ffffffff817ad62a>] br_handle_frame+0x213/0x229
      [  579.785179]  [<ffffffff817ad417>] ? br_handle_frame_finish+0x257/0x257
      [  579.785179]  [<ffffffff816e3b47>] __netif_receive_skb+0x2b4/0x3f1
      [  579.785179]  [<ffffffff816e69fc>] process_backlog+0x99/0x1e2
      [  579.785179]  [<ffffffff816e6800>] net_rx_action+0xdf/0x242
      [  579.785179]  [<ffffffff8107e8a8>] __do_softirq+0xc1/0x1e0
      [  579.785179]  [<ffffffff8135a5ba>] ? trace_hardirqs_off_thunk+0x3a/0x6c
      [  579.785179]  [<ffffffff8188812c>] call_softirq+0x1c/0x30
      
      The steps to reproduce as follow,
      
      1. On Host1, setup brige br0(192.168.1.106)
      2. Boot a kvm guest(192.168.1.105) on Host1 and start httpd
      3. Start IPVS service on Host1
         ipvsadm -A -t 192.168.1.106:80 -s rr
         ipvsadm -a -t 192.168.1.106:80 -r 192.168.1.105:80 -m
      4. Run apache benchmark on Host2(192.168.1.101)
         ab -n 1000 http://192.168.1.106/
      
      ip_vs_reply4
        ip_vs_out
          handle_response
            ip_vs_notrack
              nf_reset()
              {
                skb->nf_bridge = NULL;
              }
      
      Actually, IPVS wants in this case just to replace nfct
      with untracked version. So replace the nf_reset(skb) call
      in ip_vs_notrack() with a nf_conntrack_put(skb->nfct) call.
      Signed-off-by: NLin Ming <mlin@ss.pku.edu.cn>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9e33ce45
  9. 09 5月, 2012 3 次提交
    • P
      ipvs: add support for sync threads · f73181c8
      Pablo Neira Ayuso 提交于
      	Allow master and backup servers to use many threads
      for sync traffic. Add sysctl var "sync_ports" to define the
      number of threads. Every thread will use single UDP port,
      thread 0 will use the default port 8848 while last thread
      will use port 8848+sync_ports-1.
      
      	The sync traffic for connections is scheduled to many
      master threads based on the cp address but one connection is
      always assigned to same thread to avoid reordering of the
      sync messages.
      
      	Remove ip_vs_sync_switch_mode because this check
      for sync mode change is still risky. Instead, check for mode
      change under sync_buff_lock.
      
      	Make sure the backup socks do not block on reading.
      
      Special thanks to Aleksey Chudov for helping in all tests.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Tested-by: NAleksey Chudov <aleksey.chudov@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      f73181c8
    • J
      ipvs: reduce sync rate with time thresholds · 749c42b6
      Julian Anastasov 提交于
      	Add two new sysctl vars to control the sync rate with the
      main idea to reduce the rate for connection templates because
      currently it depends on the packet rate for controlled connections.
      This mechanism should be useful also for normal connections
      with high traffic.
      
      sync_refresh_period: in seconds, difference in reported connection
      	timer that triggers new sync message. It can be used to
      	avoid sync messages for the specified period (or half of
      	the connection timeout if it is lower) if connection state
      	is not changed from last sync.
      
      sync_retries: integer, 0..3, defines sync retries with period of
      	sync_refresh_period/8. Useful to protect against loss of
      	sync messages.
      
      	Allow sysctl_sync_threshold to be used with
      sysctl_sync_period=0, so that only single sync message is sent
      if sync_refresh_period is also 0.
      
      	Add new field "sync_endtime" in connection structure to
      hold the reported time when connection expires. The 2 lowest
      bits will represent the retry count.
      
      	As the sysctl_sync_period now can be 0 use ACCESS_ONCE to
      avoid division by zero.
      
      	Special thanks to Aleksey Chudov for being patient with me,
      for his extensive reports and helping in all tests.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Tested-by: NAleksey Chudov <aleksey.chudov@gmail.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      749c42b6
    • P
      ipvs: wakeup master thread · 1c003b15
      Pablo Neira Ayuso 提交于
      	High rate of sync messages in master can lead to
      overflowing the socket buffer and dropping the messages.
      Fixed sleep of 1 second without wakeup events is not suitable
      for loaded masters,
      
      	Use delayed_work to schedule sending for queued messages
      and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will
      reduce the rate of wakeups but to avoid sending long bursts we
      wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages.
      
      	Add hard limit for the queued messages before sending
      by using "sync_qlen_max" sysctl var. It defaults to 1/32 of
      the memory pages but actually represents number of messages.
      It will protect us from allocating large parts of memory
      when the sending rate is lower than the queuing rate.
      
      	As suggested by Pablo, add new sysctl var
      "sync_sock_size" to configure the SNDBUF (master) or
      RCVBUF (slave) socket limit. Default value is 0 (preserve
      system defaults).
      
      	Change the master thread to detect and block on
      SNDBUF overflow, so that we do not drop messages when
      the socket limit is low but the sync_qlen_max limit is
      not reached. On ENOBUFS or other errors just drop the
      messages.
      
      	Change master thread to enter TASK_INTERRUPTIBLE
      state early, so that we do not miss wakeups due to messages or
      kthread_should_stop event.
      
      Thanks to Pablo Neira Ayuso for his valuable feedback!
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      1c003b15
  10. 30 4月, 2012 2 次提交
  11. 21 4月, 2012 1 次提交
  12. 16 4月, 2012 1 次提交
  13. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  14. 31 12月, 2011 1 次提交
  15. 23 11月, 2011 1 次提交
  16. 01 11月, 2011 2 次提交