1. 18 11月, 2016 12 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50
    • D
      Merge branch 'rds-ha-failover-fixes' · fcd2b0da
      David S. Miller 提交于
      Sowmini Varadhan says:
      
      ====================
      RDS: TCP: HA/Failover fixes
      
      This series contains a set of fixes for bugs exposed when
      we ran the following in a loop between a test machine pair:
      
       while (1); do
         # modprobe rds-tcp on test nodes
         # run rds-stress in bi-dir mode between test machine pair
         # modprobe -r rds-tcp on test nodes
       done
      
      rds-stress in bi-dir mode will cause both nodes to initiate
      RDS-TCP connections at almost the same instant, exposing the
      bugs fixed in this series.
      
      Without the fixes, rds-stress reports sporadic packet drops,
      and packets arriving out of sequence. After the fixes,we have
      been able to run the  test overnight, without any issues.
      
      Each patch has a detailed description of the root-cause fixed
      by the patch.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcd2b0da
    • S
      RDS: TCP: Force every connection to be initiated by numerically smaller IP address · 1a0e100f
      Sowmini Varadhan 提交于
      When 2 RDS peers initiate an RDS-TCP connection simultaneously,
      there is a potential for "duelling syns" on either/both sides.
      See commit 241b2719 ("RDS-TCP: Reset tcp callbacks if re-using an
      outgoing socket in rds_tcp_accept_one()") for a description of this
      condition, and the arbitration logic which ensures that the
      numerically large IP address in the TCP connection is bound to the
      RDS_TCP_PORT ("canonical ordering").
      
      The rds_connection should not be marked as RDS_CONN_UP until the
      arbitration logic has converged for the following reason. The sender
      may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
      and since the sender removes all datagrams from the rds_connection's
      cp_retrans queue based on TCP acks. If the TCP ack was sent from
      a tcp socket that got reset as part of duel aribitration (but
      before data was delivered to the receivers RDS socket layer),
      the sender may end up prematurely freeing the datagram, and
      the datagram is no longer reliably deliverable.
      
      This patch remedies that condition by making sure that, upon
      receipt of 3WH completion state change notification of TCP_ESTABLISHED
      in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
      if, and only if, the IP addresses and ports for the connection are
      canonically ordered. In all other cases, rds_tcp_state_change will
      force an rds_conn_path_drop(), and rds_queue_reconnect() on
      both peers will restart the connection to ensure canonical ordering.
      
      A side-effect of enforcing this condition in rds_tcp_state_change()
      is that rds_tcp_accept_one_path() can now be refactored for simplicity.
      It is also no longer possible to encounter an RDS_CONN_UP connection in
      the arbitration logic in rds_tcp_accept_one().
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a0e100f
    • S
      RDS: TCP: Track peer's connection generation number · 905dd418
      Sowmini Varadhan 提交于
      The RDS transport has to be able to distinguish between
      two types of failure events:
      (a) when the transport fails (e.g., TCP connection reset)
          but the RDS socket/connection layer on both sides stays
          the same
      (b) when the peer's RDS layer itself resets (e.g., due to module
          reload or machine reboot at the peer)
      In case (a) both sides must reconnect and continue the RDS messaging
      without any message loss or disruption to the message sequence numbers,
      and this is achieved by rds_send_path_reset().
      
      In case (b) we should reset all rds_connection state to the
      new incarnation of the peer. Examples of state that needs to
      be reset are next expected rx sequence number from, or messages to be
      retransmitted to, the new incarnation of the peer.
      
      To achieve this, the RDS handshake probe added as part of
      commit 5916e2c1 ("RDS: TCP: Enable multipath RDS for TCP")
      is enhanced so that sender and receiver of the RDS ping-probe
      will add a generation number as part of the RDS_EXTHDR_GEN_NUM
      extension header. Each peer stores local and remote generation
      numbers as part of each rds_connection. Changes in generation
      number will be detected via incoming handshake probe ping
      request or response and will allow the receiver to reset rds_connection
      state.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      905dd418
    • S
      RDS: TCP: set RDS_FLAG_RETRANSMITTED in cp_retrans list · 315ca6d9
      Sowmini Varadhan 提交于
      As noted in rds_recv_incoming() sequence numbers on data packets
      can decreas for the failover case, and the Rx path is equipped
      to recover from this, if the RDS_FLAG_RETRANSMITTED is set
      on the rds header of an incoming message with a suspect sequence
      number.
      
      The RDS_FLAG_RETRANSMITTED is predicated on the RDS_FLAG_RETRANSMITTED
      flag in the rds_message, so make sure the flag is set on messages
      queued for retransmission.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      315ca6d9
    • L
      net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpart · b3e51069
      LABBE Corentin 提交于
      As sugested by Joe Perches, we could replace all
      if (netif_msg_type(priv)) dev_xxx(priv->devices, ...)
      by the simpler macro netif_xxx(priv, hw, priv->dev, ...)
      Signed-off-by: NCorentin Labbe <clabbe.montjoie@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3e51069
    • L
      net: stmmac: replace hardcoded function name by __func__ · de9a2165
      LABBE Corentin 提交于
      Some printing have the function name hardcoded.
      It is better to use __func__ instead.
      Signed-off-by: NCorentin Labbe <clabbe.montjoie@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de9a2165
    • L
      net: stmmac: replace all pr_xxx by their netdev_xxx counterpart · 38ddc59d
      LABBE Corentin 提交于
      The stmmac driver use lots of pr_xxx functions to print information.
      This is bad since we cannot know which device logs the information.
      (moreover if two stmmac device are present)
      
      Furthermore, it seems that it assumes wrongly that all logs will always
      be subsequent by using a dev_xxx then some indented pr_xxx like this:
      kernel: sun7i-dwmac 1c50000.ethernet: no reset control found
      kernel:  Ring mode enabled
      kernel:  No HW DMA feature register supported
      kernel:  Normal descriptors
      kernel:  TX Checksum insertion supported
      
      So this patch replace all pr_xxx by their netdev_xxx counterpart.
      Excepts for some printing where netdev "cause" unpretty output like:
      sun7i-dwmac 1c50000.ethernet (unnamed net_device) (uninitialized): no reset control found
      In those case, I keep dev_xxx.
      
      In the same time I remove some "stmmac:" print since
      this will be a duplicate with that dev_xxx displays.
      Signed-off-by: NCorentin Labbe <clabbe.montjoie@gmail.com>
      Acked-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38ddc59d
    • E
      net_sched: sch_fq: use hash_ptr() · 29c58472
      Eric Dumazet 提交于
      When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
      and I chose hash_32().
      
      Linus Torvalds and George Spelvin fixed this issue, so we can
      use hash_ptr() to get more entropy on 64bit arches with Terabytes
      of memory, and avoid the cast games.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29c58472
    • E
      net/mlx5e: remove napi_hash_del() calls · d30d9ccb
      Eric Dumazet 提交于
      Calling napi_hash_del() after netif_napi_del() is pointless.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d30d9ccb
    • E
      net/mlx4_en: remove napi_hash_del() call · bb07fafa
      Eric Dumazet 提交于
      There is no need calling napi_hash_del()+synchronize_rcu() before
      calling netif_napi_del()
      
      netif_napi_del() does this already.
      
      Using napi_hash_del() in a driver is useful only when dealing with
      a batch of NAPI structures, so that a single synchronize_rcu() can
      be used. mlx4_en_deactivate_cq() is deactivating a single NAPI.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb07fafa
  2. 17 11月, 2016 28 次提交