1. 09 12月, 2016 2 次提交
  2. 04 12月, 2016 1 次提交
  3. 03 12月, 2016 1 次提交
  4. 01 12月, 2016 1 次提交
  5. 30 11月, 2016 1 次提交
    • E
      mlx4: give precise rx/tx bytes/packets counters · 40931b85
      Eric Dumazet 提交于
      mlx4 stats are chaotic because a deferred work queue is responsible
      to update them every 250 ms.
      
      Even sampling stats every one second with "sar -n DEV 1" gives
      variations like the following :
      
      lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
      07:39:22         eth0 146877.00 3265554.00   9467.15 4828168.50
      07:39:23         eth0 146587.00 3260329.00   9448.15 4820445.98
      07:39:24         eth0 146894.00 3259989.00   9468.55 4819943.26
      07:39:25         eth0 110368.00 2454497.00   7113.95 3629012.17  <<>>
      07:39:26         eth0 146563.00 3257502.00   9447.25 4816266.23
      07:39:27         eth0 145678.00 3258292.00   9389.79 4817414.39
      07:39:28         eth0 145268.00 3253171.00   9363.85 4809852.46
      07:39:29         eth0 146439.00 3262185.00   9438.97 4823172.48
      07:39:30         eth0 146758.00 3264175.00   9459.94 4826124.13
      07:39:31         eth0 146843.00 3256903.00   9465.44 4815381.97
      Average:         eth0 142827.50 3179259.70   9206.30 4700578.16
      
      This patch allows rx/tx bytes/packets counters being folded at the
      time we need stats.
      
      We now can fetch stats every 1 ms if we want to check NIC behavior
      on a small time window. It is also easier to detect anomalies.
      
      lpaa23:~# sar -n DEV 1 10 | grep eth0 | cut -c1-65
      07:42:50         eth0 142915.00 3177696.00   9212.06 4698270.42
      07:42:51         eth0 143741.00 3200232.00   9265.15 4731593.02
      07:42:52         eth0 142781.00 3171600.00   9202.92 4689260.16
      07:42:53         eth0 143835.00 3192932.00   9271.80 4720761.39
      07:42:54         eth0 141922.00 3165174.00   9147.64 4679759.21
      07:42:55         eth0 142993.00 3207038.00   9216.78 4741653.05
      07:42:56         eth0 141394.06 3154335.64   9113.85 4663731.73
      07:42:57         eth0 141850.00 3161202.00   9144.48 4673866.07
      07:42:58         eth0 143439.00 3180736.00   9246.05 4702755.35
      07:42:59         eth0 143501.00 3210992.00   9249.99 4747501.84
      Average:         eth0 142835.66 3182165.93   9206.98 4704874.08
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40931b85
  6. 29 11月, 2016 2 次提交
  7. 28 11月, 2016 1 次提交
  8. 25 11月, 2016 1 次提交
    • E
      mlx4: reorganize struct mlx4_en_tx_ring · e3f42f84
      Eric Dumazet 提交于
      Goal is to reorganize this critical structure to increase performance.
      
      ndo_start_xmit() should only dirty one cache line, and access as few
      cache lines as possible.
      
      Add sp_ (Slow Path) prefix to fields that are not used in fast path,
      to make clear what is going on.
      
      After this patch pahole reports something much better, as all
      ndo_start_xmit() needed fields are packed into two cache lines instead
      of seven or eight
      
      struct mlx4_en_tx_ring {
      	u32                        last_nr_txbb;         /*     0   0x4 */
      	u32                        cons;                 /*   0x4   0x4 */
      	long unsigned int          wake_queue;           /*   0x8   0x8 */
      	struct netdev_queue *      tx_queue;             /*  0x10   0x8 */
      	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /*  0x18   0x8 */
      	struct mlx4_en_rx_ring *   recycle_ring;         /*  0x20   0x8 */
      
      	/* XXX 24 bytes hole, try to pack */
      
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	u32                        prod;                 /*  0x40   0x4 */
      	unsigned int               tx_dropped;           /*  0x44   0x4 */
      	long unsigned int          bytes;                /*  0x48   0x8 */
      	long unsigned int          packets;              /*  0x50   0x8 */
      	long unsigned int          tx_csum;              /*  0x58   0x8 */
      	long unsigned int          tso_packets;          /*  0x60   0x8 */
      	long unsigned int          xmit_more;            /*  0x68   0x8 */
      	struct mlx4_bf             bf;                   /*  0x70  0x18 */
      	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      	__be32                     doorbell_qpn;         /*  0x88   0x4 */
      	__be32                     mr_key;               /*  0x8c   0x4 */
      	u32                        size;                 /*  0x90   0x4 */
      	u32                        size_mask;            /*  0x94   0x4 */
      	u32                        full_size;            /*  0x98   0x4 */
      	u32                        buf_size;             /*  0x9c   0x4 */
      	void *                     buf;                  /*  0xa0   0x8 */
      	struct mlx4_en_tx_info *   tx_info;              /*  0xa8   0x8 */
      	int                        qpn;                  /*  0xb0   0x4 */
      	u8                         queue_index;          /*  0xb4   0x1 */
      	bool                       bf_enabled;           /*  0xb5   0x1 */
      	bool                       bf_alloced;           /*  0xb6   0x1 */
      	u8                         hwtstamp_tx_type;     /*  0xb7   0x1 */
      	u8 *                       bounce_buf;           /*  0xb8   0x8 */
      	/* --- cacheline 3 boundary (192 bytes) --- */
      	long unsigned int          queue_stopped;        /*  0xc0   0x8 */
      	struct mlx4_hwq_resources  sp_wqres;             /*  0xc8  0x58 */
      	/* --- cacheline 4 boundary (256 bytes) was 32 bytes ago --- */
      	struct mlx4_qp             sp_qp;                /* 0x120  0x30 */
      	/* --- cacheline 5 boundary (320 bytes) was 16 bytes ago --- */
      	struct mlx4_qp_context     sp_context;           /* 0x150  0xf8 */
      	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
      	cpumask_t                  sp_affinity_mask;     /* 0x248  0x20 */
      	enum mlx4_qp_state         sp_qp_state;          /* 0x268   0x4 */
      	u16                        sp_stride;            /* 0x26c   0x2 */
      	u16                        sp_cqn;               /* 0x26e   0x2 */
      
      	/* size: 640, cachelines: 10, members: 36 */
      	/* sum members: 600, holes: 1, sum holes: 24 */
      	/* padding: 16 */
      };
      
      Instead of this silly placement :
      
      struct mlx4_en_tx_ring {
      	u32                        last_nr_txbb;         /*     0   0x4 */
      	u32                        cons;                 /*   0x4   0x4 */
      	long unsigned int          wake_queue;           /*   0x8   0x8 */
      
      	/* XXX 48 bytes hole, try to pack */
      
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	u32                        prod;                 /*  0x40   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          bytes;                /*  0x48   0x8 */
      	long unsigned int          packets;              /*  0x50   0x8 */
      	long unsigned int          tx_csum;              /*  0x58   0x8 */
      	long unsigned int          tso_packets;          /*  0x60   0x8 */
      	long unsigned int          xmit_more;            /*  0x68   0x8 */
      	unsigned int               tx_dropped;           /*  0x70   0x4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct mlx4_bf             bf;                   /*  0x78  0x18 */
      	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
      	long unsigned int          queue_stopped;        /*  0x90   0x8 */
      	cpumask_t                  affinity_mask;        /*  0x98  0x10 */
      	struct mlx4_qp             qp;                   /*  0xa8  0x30 */
      	/* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mlx4_hwq_resources  wqres;                /*  0xd8  0x58 */
      	/* --- cacheline 4 boundary (256 bytes) was 48 bytes ago --- */
      	u32                        size;                 /* 0x130   0x4 */
      	u32                        size_mask;            /* 0x134   0x4 */
      	u16                        stride;               /* 0x138   0x2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	u32                        full_size;            /* 0x13c   0x4 */
      	/* --- cacheline 5 boundary (320 bytes) --- */
      	u16                        cqn;                  /* 0x140   0x2 */
      
      	/* XXX 2 bytes hole, try to pack */
      
      	u32                        buf_size;             /* 0x144   0x4 */
      	__be32                     doorbell_qpn;         /* 0x148   0x4 */
      	__be32                     mr_key;               /* 0x14c   0x4 */
      	void *                     buf;                  /* 0x150   0x8 */
      	struct mlx4_en_tx_info *   tx_info;              /* 0x158   0x8 */
      	struct mlx4_en_rx_ring *   recycle_ring;         /* 0x160   0x8 */
      	u32                        (*free_tx_desc)(struct mlx4_en_priv *, struct mlx4_en_tx_ring *, int, u8, u64, int); /* 0x168   0x8 */
      	u8 *                       bounce_buf;           /* 0x170   0x8 */
      	struct mlx4_qp_context     context;              /* 0x178  0xf8 */
      	/* --- cacheline 9 boundary (576 bytes) was 48 bytes ago --- */
      	int                        qpn;                  /* 0x270   0x4 */
      	enum mlx4_qp_state         qp_state;             /* 0x274   0x4 */
      	u8                         queue_index;          /* 0x278   0x1 */
      	bool                       bf_enabled;           /* 0x279   0x1 */
      	bool                       bf_alloced;           /* 0x27a   0x1 */
      
      	/* XXX 5 bytes hole, try to pack */
      
      	/* --- cacheline 10 boundary (640 bytes) --- */
      	struct netdev_queue *      tx_queue;             /* 0x280   0x8 */
      	int                        hwtstamp_tx_type;     /* 0x288   0x4 */
      
      	/* size: 704, cachelines: 11, members: 36 */
      	/* sum members: 587, holes: 6, sum holes: 65 */
      	/* padding: 52 */
      };
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3f42f84
  9. 24 11月, 2016 1 次提交
  10. 22 11月, 2016 1 次提交
    • E
      mlx4: avoid unnecessary dirtying of critical fields · dad42c30
      Eric Dumazet 提交于
      While stressing a 40Gbit mlx4 NIC with busy polling, I found false
      sharing in mlx4 driver that can be easily avoided.
      
      This patch brings an additional 7 % performance improvement in UDP_RR
      workload.
      
      1) If we received no frame during one mlx4_en_process_rx_cq()
         invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons
      
      2) Do not refill rx buffers if we have plenty of them.
         This avoids false sharing and allows some bulk/batch optimizations.
         Page allocator and its locks will thank us.
      
      Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined
      cpu handling NIC IRQ should be changed. We should return budget-1
      instead, to not fool net_rx_action() and its netdev_budget.
      
      v2: keep AVG_PERF_COUNTER(... polled) even if polled is 0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dad42c30
  11. 18 11月, 2016 1 次提交
  12. 17 11月, 2016 1 次提交
  13. 13 11月, 2016 1 次提交
  14. 10 11月, 2016 1 次提交
  15. 03 11月, 2016 3 次提交
  16. 30 10月, 2016 11 次提交
  17. 18 10月, 2016 1 次提交
  18. 14 10月, 2016 1 次提交
    • B
      net/mlx4_en: fixup xdp tx irq to match rx · 958b3d39
      Brenden Blanco 提交于
      In cases where the number of tx rings is not a multiple of the number of
      rx rings, the tx completion event will be handled on a different core
      from the transmit and population of the ring. Races on the ring will
      lead to a double-free of the page, and possibly other corruption.
      
      The rings are initialized by default with a valid multiple of rings,
      based on the number of cpus, therefore an invalid configuration requires
      ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
      8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
      completed on cpu3 (48 % 9 == 3).
      
      Resolve this discrepancy by shifting the irq for the xdp tx queues to
      start again from 0, modulo rx_ring_num.
      
      Fixes: 9ecc2d86 ("net/mlx4_en: add xdp forwarding and data write support")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      958b3d39
  19. 08 10月, 2016 1 次提交
    • J
      IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets · fd10ed8e
      Jack Morgenstein 提交于
      In MLX qp packets, the LRH (built by the driver) has both a VL field
      and an SL field. When building a QP1 packet, the VL field should
      reflect the SLtoVL mapping and not arbitrarily contain zero (as is
      done now). This bug causes credit problems in IB switches at
      high rates of QP1 packets.
      
      The fix is to cache the SL to VL mapping in the driver, and look up
      the VL mapped to the SL provided in the send request when sending
      QP1 packets.
      
      For FW versions which support generating a port_management_config_change
      event with subtype sl-to-vl-table-change, the driver uses that event
      to update its sl-to-vl mapping cache.  Otherwise, the driver snoops
      incoming SMP mads to update the cache.
      
      There remains the case where the FW is running in secure-host mode
      (so no QP0 packets are delivered to the driver), and the FW does not
      generate the sl2vl mapping change event. To support this case, the
      driver updates (via querying the FW) its sl2vl mapping cache when
      running in secure-host mode when it receives either a Port Up event
      or a client-reregister event (where the port is still up, but there
      may have been an opensm failover).
      OpenSM modifies the sl2vl mapping before Port Up and Client-reregister
      events occur, so if there is a mapping change the driver's cache will
      be properly updated.
      
      Fixes: 225c7b1f ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      fd10ed8e
  20. 30 9月, 2016 1 次提交
    • D
      mlx4: remove unused fields · 5038056e
      David Decotigny 提交于
      This also can address following UBSAN warnings:
      [   36.640343] ================================================================================
      [   36.648772] UBSAN: Undefined behaviour in drivers/net/ethernet/mellanox/mlx4/fw.c:857:26
      [   36.656853] shift exponent 64 is too large for 32-bit type 'int'
      [   36.663348] ================================================================================
      [   36.671783] ================================================================================
      [   36.680213] UBSAN: Undefined behaviour in drivers/net/ethernet/mellanox/mlx4/fw.c:861:27
      [   36.688297] shift exponent 35 is too large for 32-bit type 'int'
      [   36.694702] ================================================================================
      
      Tested:
        reboot with UBSAN, no warning.
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5038056e
  21. 24 9月, 2016 5 次提交
  22. 22 9月, 2016 1 次提交