1. 01 3月, 2015 1 次提交
  2. 25 2月, 2015 1 次提交
  3. 23 2月, 2015 1 次提交
  4. 22 2月, 2015 1 次提交
  5. 21 2月, 2015 2 次提交
  6. 20 2月, 2015 1 次提交
  7. 16 2月, 2015 1 次提交
  8. 15 2月, 2015 2 次提交
  9. 14 2月, 2015 1 次提交
  10. 12 2月, 2015 1 次提交
  11. 09 2月, 2015 2 次提交
    • E
      net:rfs: adjust table size checking · 93c1af6c
      Eric Dumazet 提交于
      Make sure root user does not try something stupid.
      
      Also make sure mask field in struct rps_sock_flow_table
      does not share a cache line with the potentially often dirtied
      flow table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c1af6c
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  12. 08 2月, 2015 2 次提交
  13. 06 2月, 2015 2 次提交
  14. 05 2月, 2015 3 次提交
    • E
      net: remove some sparse warnings · 2ce1ee17
      Eric Dumazet 提交于
      netdev_adjacent_add_links() and netdev_adjacent_del_links()
      are static.
      
      queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ce1ee17
    • M
      net/core: Add event for a change in slave state · 61bd3857
      Moni Shoua 提交于
      Add event which provides an indication on a change in the state
      of a bonding slave. The event handler should cast the pointer to the
      appropriate type (struct netdev_bonding_info) in order to get the
      full info about the slave.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bd3857
    • E
      xps: fix xps for stacked devices · 2bd82484
      Eric Dumazet 提交于
      A typical qdisc setup is the following :
      
      bond0 : bonding device, using HTB hierarchy
      eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
      
      XPS allows to spread packets on specific tx queues, based on the cpu
      doing the send.
      
      Problem is that dequeues from bond0 qdisc can happen on random cpus,
      due to the fact that qdisc_run() can dequeue a batch of packets.
      
      CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
      CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
      
      CPUB -> dequeue packet P1 from bond0
              enqueue packet on eth1/eth2
      CPUC -> dequeue packet P2 from bond0
              enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
      
      get_xps_queue() then might select wrong queue for P1, since current cpu
      might be different than CPUA.
      
      P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
      if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
      
      Effect of this bug is TCP reorders, and more generally not optimal
      TX queue placement. (A victim bulk flow can be migrated to the wrong TX
      queue for a while)
      
      To fix this, we have to record sender cpu number the first time
      dev_queue_xmit() is called for one tx skb.
      
      We can union napi_id (used on receive path) and sender_cpu,
      granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
      this union idea)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd82484
  15. 04 2月, 2015 1 次提交
  16. 03 2月, 2015 2 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  17. 02 2月, 2015 1 次提交
  18. 31 1月, 2015 1 次提交
  19. 30 1月, 2015 2 次提交
  20. 29 1月, 2015 1 次提交
  21. 27 1月, 2015 1 次提交
  22. 24 1月, 2015 2 次提交
  23. 23 1月, 2015 1 次提交
  24. 20 1月, 2015 3 次提交
  25. 19 1月, 2015 2 次提交
  26. 18 1月, 2015 2 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
    • R
      bridge: fix setlink/dellink notifications · 02dba438
      Roopa Prabhu 提交于
      problems with bridge getlink/setlink notifications today:
              - bridge setlink generates two notifications to userspace
                      - one from the bridge driver
                      - one from rtnetlink.c (rtnl_bridge_notify)
              - dellink generates one notification from rtnetlink.c. Which
      	means bridge setlink and dellink notifications are not
      	consistent
      
              - Looking at the code it appears,
      	If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
              the size calculation in rtnl_bridge_notify can be wrong.
              Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
              in a setlink request to rocker dev, rtnl_bridge_notify will
      	allocate skb for one set of bridge attributes, but,
      	both the bridge driver and rocker dev will try to add
      	attributes resulting in twice the number of attributes
      	being added to the skb.  (rocker dev calls ndo_dflt_bridge_getlink)
      
      There are multiple options:
      1) Generate one notification including all attributes from master and self:
         But, I don't think it will work, because both master and self may use
         the same attributes/policy. Cannot pack the same set of attributes in a
         single notification from both master and slave (duplicate attributes).
      
      2) Generate one notification from master and the other notification from
         self (This seems to be ideal):
           For master: the master driver will send notification (bridge in this
      	example)
           For self: the self driver will send notification (rocker in the above
      	example. It can use helpers from rtnetlink.c to do so. Like the
      	ndo_dflt_bridge_getlink api).
      
      This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
      with 'self').
      
      v1->v2 :
      	- rtnl_bridge_notify is now called only for self,
      	so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
      	- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
      	earlier. So, I have changed the notification from br_dellink to
      	go as RTM_NEWLINK
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02dba438