1. 25 9月, 2015 22 次提交
    • D
      Merge branch 'phy-mdio-refcnt' · b626ef01
      David S. Miller 提交于
      Russell King says:
      
      ====================
      Phy, mdiobus, and netdev struct device fixes
      
      The third version of this series fixes the build error which David
      identified, and drops the broken changes for the Cavium Thunger BGX
      ethernet driver as this driver requires some complex changes to
      resolve the leakage - and this is best done by people who can test
      the driver.
      
      Compared to v2, the only patch which has changed is patch 6
        "net: fix phy refcounting in a bunch of drivers"
      
      I _think_ I've been able to build-test all the drivers touched by
      that patch to some degree now, though several of them needed the
      Kconfig hacked to allow it (not all had || COMPILE_TEST clause on
      their dependencies.)
      
      Previous cover letters below:
      
      This is the second version of the series, with the comments David had
      on the first patch fixed up.  Original series description with updated
      diffstat below.
      
      While looking at the DSA code, I noticed we have a
      of_find_net_device_by_node(), and it looks like users of that are
      similarly buggy - it looks like net/dsa/dsa.c is the only user.  Fix
      that too.
      
      Hi,
      
      While looking at the phy code, I identified a number of weaknesses
      where refcounting on device structures was being leaked, where
      modules could be removed while in-use, and where the fixed-phy could
      end up having unintended consequences caused by incorrect calls to
      fixed_phy_update_state().
      
      This patch series resolves those issues, some of which were discovered
      with testing on an Armada 388 board.  Not all patches are fully tested,
      particularly the one which touches several network drivers.
      
      When resolving the struct device refcounting problems, several different
      solutions were considered before settling on the implementation here -
      one of the considerations was to avoid touching many network drivers.
      The solution here is:
      
      	phy_attach*() - takes a refcount
      	phy_detach*() - drops the phy_attach refcount
      
      Provided drivers always attach and detach their phys, which they should
      already be doing, this should change nothing, even if they leak a refcount.
      
      	of_phy_find_device() and of_* functions which use that take
      	a refcount.  Arrange for this refcount to be dropped once
      	the phy is attached.
      
      This is the reason why the previous change is important - we can't drop
      this refcount taken by of_phy_find_device() until something else holds
      a reference on the device.  This resolves the leaked refcount caused by
      using of_phy_connect() or of_phy_attach().
      
      Even without the above changes, these drivers are leaking by calling
      of_phy_find_device().  These drivers are addressed by adding the
      appropriate release of that refcount.
      
      The mdiobus code also suffered from the same kind of leak, but thankfully
      this only happened in one place - the mdio-mux code.
      
      I also found that the try_module_get() in the phy layer code was utterly
      useless: phydev->dev.driver was guaranteed to always be NULL, so
      try_module_get() was always being called with a NULL argument.  I proved
      this with my SFP code, which declares its own MDIO bus - the module use
      count was never incremented irrespective of how I set the MDIO bus up.
      This allowed the MDIO bus code to be removed from the kernel while there
      were still PHYs attached to it.
      
      One other bug was discovered: while using in-band-status with mvneta, it
      was found that if a real phy is attached with in-band-status enabled,
      and another ethernet interface is using the fixed-phy infrastructure, the
      interface using the fixed-phy infrastructure is configured according to
      the other interface using the in-band-status - which is caused by the
      fixed-phy code not verifying that the phy_device passed in is actually
      a fixed-phy device, rather than a real MDIO phy.
      
      Lastly, having mdio_bus reversing phy_device_register() internals seems
      like a layering violation - it's trivial to move that code to the phy
      device layer.
      ====================
      Tested-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b626ef01
    • R
      net: fix net_device refcounting · 9861f720
      Russell King 提交于
      of_find_net_device_by_node() uses class_find_device() internally to
      lookup the corresponding network device.  class_find_device() returns
      a reference to the embedded struct device, with its refcount
      incremented.
      
      Add a comment to the definition in net/core/net-sysfs.c indicating the
      need to drop this refcount, and fix the DSA code to drop this refcount
      when the OF-generated platform data is cleaned up and freed.  Also
      arrange for the ref to be dropped when handling errors.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9861f720
    • R
      phy: add phy_device_remove() · 38737e49
      Russell King 提交于
      Add a phy_device_remove() function to complement phy_device_register(),
      which undoes the effects of phy_device_register() by removing the phy
      device from visibility, but not freeing it.
      
      This allows these details to be moved out of the mdio bus code into
      the phy code where this action belongs.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38737e49
    • R
      phy: fixed-phy: properly validate phy in fixed_phy_update_state() · d618bf2b
      Russell King 提交于
      Validate that the phy_device passed into fixed_phy_update_state() is a
      fixed-phy device before walking the list of phys for a fixed phy at the
      same address.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618bf2b
    • R
      net: fix phy refcounting in a bunch of drivers · 04d53b20
      Russell King 提交于
      of_phy_find_device() increments the phy struct device refcount, which
      we need to properly balance.  Add code to network drivers using this
      function to ensure that the struct device refcount is correctly
      balanced.
      
      For xgene, looking back in the history, we should be able to use
      of_phy_connect() with a zero flags argument for the DT case as this is
      how the driver used to operate prior to de7b5b3d ("net: eth: xgene:
      change APM X-Gene SoC platform ethernet to support ACPI").
      
      This leaves the Cavium Thunder BGX unfixed; fixing this driver is a
      complicated task, one which the maintainers need to be involved with.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d53b20
    • R
      of_mdio: fix MDIO phy device refcounting · f018ae7a
      Russell King 提交于
      bus_find_device() is defined as:
      
       * This is similar to the bus_for_each_dev() function above, but it
       * returns a reference to a device that is 'found' for later use, as
       * determined by the @match callback.
      
      and it does indeed return a reference-counted pointer to the device:
      
              while ((dev = next_device(&i)))
                      if (match(dev, data) && get_device(dev))
                                              ^^^^^^^^^^^^^^^
                              break;
              klist_iter_exit(&i);
              return dev;
      
      What that means is that when we're done with the struct device, we must
      drop that reference.  Neither of_phy_connect() nor of_phy_attach() did
      this when phy_connect_direct() or phy_attach_direct() failed.
      
      With our previous patch, phy_connect_direct() and phy_attach_direct()
      take a new refcount on the phy device when successful, so we can drop
      our local reference immediatley after these functions, whether or not
      they succeeded.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f018ae7a
    • R
      phy: add proper phy struct device refcounting · 7322967b
      Russell King 提交于
      Take a refcount on the phy struct device when the phy device is attached
      to a network device, and drop it after it's detached.  This ensures that
      a refcount is held on the phy device while the device is being used by
      a network device, thereby preventing the phy_device from being
      unexpectedly kfree()'d by phy_device_release().
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7322967b
    • R
      phy: fix mdiobus module safety · 3e3aaf64
      Russell King 提交于
      Re-implement the mdiobus module refcounting to ensure that we actually
      ensure that the mdiobus module code does not go away while we might call
      into it.
      
      The old scheme using bus->dev.driver was buggy, because bus->dev is a
      class device which never has a struct device_driver associated with it,
      and hence the associated code trying to obtain a refcount did nothing
      useful.
      
      Instead, take the approach that other subsystems do: pass the module
      when calling mdiobus_register(), and record that in the mii_bus struct.
      When we need to increment the module use count in the phy code, use
      this stored pointer.  When the phy is deteched, drop the module
      refcount, remembering that the phy device might go away at that point.
      
      This doesn't stop the mii_bus going away while there are in-use phys -
      it merely stops the underlying code vanishing.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3aaf64
    • R
      net: dsa: fix of_mdio_find_bus() device refcount leak · e496ae69
      Russell King 提交于
      Current users of of_mdio_find_bus() leak a struct device refcount, as
      they fail to clean up the reference obtained inside class_find_device().
      
      Fix the DSA code to properly refcount the returned MDIO bus by:
      1. taking a reference on the struct device whenever we assign it to
         pd->chip[x].host_dev.
      2. dropping the reference when we overwrite the existing reference.
      3. dropping the reference when we free the data structure.
      4. dropping the initial reference we obtained after setting up the
         platform data structure, or on failure.
      
      In step 2 above, where we obtain a new MDIO bus, there is no need to
      take a reference on it as we would only have to drop it immediately
      after assignment again, iow:
      
      	put_device(cd->host_dev);	/* drop original assignment ref */
      	cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
      	put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e496ae69
    • R
      phy: fix of_mdio_find_bus() device refcount leak · a1364421
      Russell King 提交于
      of_mdio_find_bus() leaks a struct device refcount, caused by using
      class_find_device() and not realising that the device reference has
      its refcount incremented:
      
       * Note, you will need to drop the reference with put_device() after use.
      ...
              while ((dev = class_dev_iter_next(&iter))) {
                      if (match(dev, data)) {
                              get_device(dev);
                              break;
                      }
      
      Update the comment, and arrange for the phy code to drop this refcount
      when disposing of a reference to it.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1364421
    • M
      ip6_tunnel: Reduce log level in ip6_tnl_err() to debug · 17a10c92
      Matt Bennett 提交于
      Currently error log messages in ip6_tnl_err are printed at 'warn'
      level. This is different to other tunnel types which don't print
      any messages. These log messages don't provide any information that
      couldn't be deduced with networking tools. Also it can be annoying
      to have one end of the tunnel go down and have the logs fill with
      pointless messages such as "Path to destination invalid or inactive!".
      
      This patch reduces the log level of these messages to 'dbg' level to
      bring the visible behaviour into line with other tunnel types.
      Signed-off-by: NMatt Bennett <matt.bennett@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17a10c92
    • D
      Merge tag 'mac80211-for-davem-2015-09-22' of... · deccbe80
      David S. Miller 提交于
      Merge tag 'mac80211-for-davem-2015-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg says:
      
      ====================
      Just two small fixes:
       * VHT MCS mask array overrun, reported by Dan Carpenter
       * reset CQM history to always get a notification, from Sara Sharon
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deccbe80
    • M
      ip6_gre: Reduce log level in ip6gre_err() to debug · a46496ce
      Matt Bennett 提交于
      Currently error log messages in ip6gre_err are printed at 'warn'
      level. This is different to most other tunnel types which don't
      print any messages. These log messages don't provide any information
      that couldn't be deduced with networking tools. Also it can be annoying
      to have one end of the tunnel go down and have the logs fill with
      pointless messages such as "Path to destination invalid or inactive!".
      
      This patch reduces the log level of these messages to 'dbg' level to
      bring the visible behaviour into line with other tunnel types.
      Signed-off-by: NMatt Bennett <matt.bennett@alliedtelesis.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a46496ce
    • W
      fib_rules: fix fib rule dumps across multiple skbs · 41fc0143
      Wilson Kok 提交于
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41fc0143
    • E
      bnx2x: byte swap rss_key to comply to Toeplitz specs · d682d2bd
      Eric Dumazet 提交于
      After a good amount of debugging, I found bnx2x was byte swaping
      the 40 bytes of rss_key.
      
      If we byte swap the key, then bnx2x generates hashes matching
      MSDN specs as documented in (Verifying the RSS Hash Calculation)
      
      https://msdn.microsoft.com/en-us/library/windows/hardware/ff571021%
      28v=vs.85%29.aspx
      
      It is mostly a non issue, unless we want to mix different NIC
      in a host, and want consistent hashing among all of them, ie
      if they all use the boot time generated rss key, or if some application
      is choosing specific tuple(s) so that incoming traffic lands into known
      rx queue(s).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d682d2bd
    • W
      net: revert "net_sched: move tp->root allocation into fw_init()" · d8aecb10
      WANG Cong 提交于
      fw filter uses tp->root==NULL to check if it is the old method,
      so it doesn't need allocation at all in this case. This patch
      reverts the offending commit and adds some comments for old
      method to make it obvious.
      
      Fixes: 33f8b9ec ("net_sched: move tp->root allocation into fw_init()")
      Reported-by: NAkshat Kakkar <akshat.1984@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8aecb10
    • D
      Merge branch 'lwt_arp' · 8fe79c60
      David S. Miller 提交于
      Jiri Benc says:
      
      ====================
      lwtunnel: make it really work, for IPv4
      
      One of the selling points of lwtunnel was the ability to specify the tunnel
      destination using routes. However, this doesn't really work currently, as
      ARP and ndisc replies are not handled correctly. ARP and ndisc replies won't
      have tunnel metadata attached, thus they will be sent out with the default
      parameters or not sent at all, either way never reaching the requester.
      
      Most of the egress tunnel parameters can be inferred from the ingress
      metada. The only and important exception is UDP ports. This patchset infers
      the egress data from the ingress data and disallow settings of UDP ports in
      tunnel routes. If there's a need for different UDP ports, a new interface
      needs to be created for each port combination. Note that it's still possible
      to specify the UDP ports to use, it just needs to be done while creating the
      vxlan/geneve interface.
      
      This covers only ARPs. IPv6 ndisc has the same problem but is harder to
      solve, as there's already dst attached to outgoing skbs. Ideas to solve this
      are welcome.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe79c60
    • J
      lwtunnel: remove source and destination UDP port config option · b194f30c
      Jiri Benc 提交于
      The UDP tunnel config is asymmetric wrt. to the ports used. The source and
      destination ports from one direction of the tunnel are not related to the
      ports of the other direction. We need to be able to respond to ARP requests
      using the correct ports without involving routing.
      
      As the consequence, UDP ports need to be fixed property of the tunnel
      interface and cannot be set per route. Remove the ability to set ports per
      route. This is still okay to do, as no kernel has been released with these
      attributes yet.
      
      Note that the ability to specify source and destination ports is preserved
      for other users of the lwtunnel API which don't use routes for tunnel key
      specification (like openvswitch).
      
      If in the future we rework ARP handling to allow port specification, the
      attributes can be added back.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b194f30c
    • J
      ipv4: send arp replies to the correct tunnel · 63d008a4
      Jiri Benc 提交于
      When using ip lwtunnels, the additional data for xmit (basically, the actual
      tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
      metadata dst. When replying to ARP requests, we need to send the reply to
      the same tunnel the request came from. This means we need to construct
      proper metadata dst for ARP replies.
      
      We could perform another route lookup to get a dst entry with the correct
      lwtstate. However, this won't always ensure that the outgoing tunnel is the
      same as the incoming one, and it won't work anyway for IPv4 duplicate
      address detection.
      
      The only thing to do is to "reverse" the ip_tunnel_info.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63d008a4
    • S
      net: gianfar: remove misuse of IRQF_NO_SUSPEND flag · d5b8d640
      Sudeep Holla 提交于
      The device is set as wakeup capable using proper wakeup API but the
      driver misuses IRQF_NO_SUSPEND to set the interrupt as wakeup source
      which is incorrect.
      
      This patch removes the use of IRQF_NO_SUSPEND flags replacing it with
      enable_irq_wake instead.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Claudiu Manoil <claudiu.manoil@freescale.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Acked-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5b8d640
    • P
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar 提交于
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      
      Following is the kernel panic msg from old kernel hitting the bug.
      
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      <IRQ>
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: NAnupam Chanda <achanda@vmware.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae459bd
    • H
      netlink: Replace rhash_portid with bound · da314c99
      Herbert Xu 提交于
      On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
      >
      > store_release and load_acquire are different from the usual memory
      > barriers and can't be paired this way.  You have to pair store_release
      > and load_acquire.  Besides, it isn't a particularly good idea to
      
      OK I've decided to drop the acquire/release helpers as they don't
      help us at all and simply pessimises the code by using full memory
      barriers (on some architectures) where only a write or read barrier
      is needed.
      
      > depend on memory barriers embedded in other data structures like the
      > above.  Here, especially, rhashtable_insert() would have write barrier
      > *before* the entry is hashed not necessarily *after*, which means that
      > in the above case, a socket which appears to have set bound to a
      > reader might not visible when the reader tries to look up the socket
      > on the hashtable.
      
      But you are right we do need an explicit write barrier here to
      ensure that the hashing is visible.
      
      > There's no reason to be overly smart here.  This isn't a crazy hot
      > path, write barriers tend to be very cheap, store_release more so.
      > Please just do smp_store_release() and note what it's paired with.
      
      It's not about being overly smart.  It's about actually understanding
      what's going on with the code.  I've seen too many instances of
      people simply sprinkling synchronisation primitives around without
      any knowledge of what is happening underneath, which is just a recipe
      for creating hard-to-debug races.
      
      > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
      > >  		}
      > >  	}
      > >
      > > -	if (!nlk->portid) {
      > > +	if (!nlk->bound) {
      >
      > I don't think you can skip load_acquire here just because this is the
      > second deref of the variable.  That doesn't change anything.  Race
      > condition could still happen between the first and second tests and
      > skipping the second would lead to the same kind of bug.
      
      The reason this one is OK is because we do not use nlk->portid or
      try to get nlk from the hash table before we return to user-space.
      
      However, there is a real bug here that none of these acquire/release
      helpers discovered.  The two bound tests here used to be a single
      one.  Now that they are separate it is entirely possible for another
      thread to come in the middle and bind the socket.  So we need to
      repeat the portid check in order to maintain consistency.
      
      > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
      > >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
      > >  		return -EPERM;
      > >
      > > -	if (!nlk->portid)
      > > +	if (!nlk->bound)
      >
      > Don't we need load_acquire here too?  Is this path holding a lock
      > which makes that unnecessary?
      
      Ditto.
      
      ---8<---
      The commit 1f770c0a ("netlink:
      Fix autobind race condition that leads to zero port ID") created
      some new races that can occur due to inconcsistencies between the
      two port IDs.
      
      Tejun is right that a barrier is unavoidable.  Therefore I am
      reverting to the original patch that used a boolean to indicate
      that a user netlink socket has been bound.
      
      Barriers have been added where necessary to ensure that a valid
      portid and the hashed socket is visible.
      
      I have also changed netlink_insert to only return EBUSY if the
      socket is bound to a portid different to the requested one.  This
      combined with only reading nlk->bound once in netlink_bind fixes
      a race where two threads that bind the socket at the same time
      with different port IDs may both succeed.
      
      Fixes: 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID")
      Reported-by: NTejun Heo <tj@kernel.org>
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Nacked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da314c99
  2. 24 9月, 2015 18 次提交
    • J
      geneve: use network byte order for destination port config parameter · 7bbe33ff
      John W. Linville 提交于
      This is primarily for consistancy with vxlan and other tunnels which
      use network byte order for similar parameters.
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bbe33ff
    • D
      8139cp: Dump contents of descriptor ring on TX timeout · 41b97641
      David Woodhouse 提交于
      We are seeing unexplained TX timeouts under heavy load. Let's try to get
      a better idea of what's going on.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41b97641
    • D
      8139cp: Fix DMA unmapping of transmitted buffers · 7f4c6856
      David Woodhouse 提交于
      The low 16 bits of the 'opts1' field in the TX descriptor are supposed
      to still contain the buffer length when the descriptor is handed back to
      us. In practice, at least on my hardware, they don't. So stash the
      original value of the opts1 field and get the length to unmap from
      there.
      
      There are other ways we could have worked out the length, but I actually
      want a stash of the opts1 field anyway so that I can dump it alongside
      the contents of the descriptor ring when we suffer a TX timeout.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f4c6856
    • D
      8139cp: Reduce duplicate csum/tso code in cp_start_xmit() · 0a5aeee0
      David Woodhouse 提交于
      We calculate the value of the opts1 descriptor field in three different
      places. With two different behaviours when given an invalid packet to
      be checksummed — none of them correct. Sort that out.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a5aeee0
    • D
      8139cp: Fix TSO/scatter-gather descriptor setup · a3b80404
      David Woodhouse 提交于
      When sending a TSO frame in multiple buffers, we were neglecting to set
      the first descriptor up in TSO mode.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3b80404
    • D
      8139cp: Fix tx_queued debug message to print correct slot numbers · 26b0bad6
      David Woodhouse 提交于
      After a certain amount of staring at the debug output of this driver, I
      realised it was lying to me.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26b0bad6
    • D
      8139cp: Do not re-enable RX interrupts in cp_tx_timeout() · aaa0062e
      David Woodhouse 提交于
      If an RX interrupt was already received but NAPI has not yet run when
      the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts
      already disabled. Blindly re-enabling them will cause an IRQ storm.
      
      (This is made particularly horrid by the fact that cp_interrupt() always
      returns that it's handled the interrupt, even when it hasn't actually
      done anything. If it didn't do that, the core IRQ code would have
      detected the storm and handled it, I'd have had a clear smoking gun
      backtrace instead of just a spontaneously resetting router, and I'd have
      at *least* two days of my life back. Changing the return value of
      cp_interrupt() will be argued about under separate cover.)
      
      Unconditionally leave RX interrupts disabled after the reset, and
      schedule NAPI to check the receive ring and re-enable them.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaa0062e
    • D
      Merge branch 'netcp-fixes' · 3c6cb3ac
      David S. Miller 提交于
      Murali Karicheri says:
      
      ====================
      net: netcp: a set of bug fixes
      
      This patch series fixes a set of issues in netcp driver seen during internal
      testing of the driver. While at it, do some clean up as well.
      
      The fixes are tested on K2HK, K2L and K2E EVMs and the boot up logs can be
      seen at
      
       http://pastebin.ubuntu.com/12533100/
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c6cb3ac
    • K
      net: netcp: fix deadlock reported by lockup detector · 8ceaf361
      Karicheri, Muralidharan 提交于
      A deadlock trace is seen in netcp driver with lockup detector enabled.
      The trace log is provided below for reference. This patch fixes the
      bug by removing the usage of netcp_modules_lock within ndo_ops functions.
      ndo_{open/close/ioctl)() is already called with rtnl_lock held. So there
      is no need to hold another mutex for serialization across processes on
      multiple cores.  So remove use of netcp_modules_lock mutex from these
      ndo ops functions.
      
      ndo_set_rx_mode() shouldn't be using a mutex as it is called from atomic
      context. In the case of ndo_set_rx_mode(), there can be call to this API
      without rtnl_lock held from an atomic context. As the underlying modules
      are expected to add address to a hardware table, it is to be protected
      across concurrent updates and hence a spin lock is used to synchronize
      the access. Same with ndo_vlan_rx_add_vid() & ndo_vlan_rx_kill_vid().
      
      Probably the netcp_modules_lock is used to protect the module not being
      removed as part of rmmod. Currently this is not fully implemented and
      assumes the interface is brought down before doing rmmod of modules.
      The support for rmmmod while interface is up is expected in a future
      patch set when additional modules such as pa, qos are added. For now
      all of the tests such as if up/down, reboot, iperf works fine with this
      patch applied.
      
      Deadlock trace seen with lockup detector enabled is shown below for
      reference.
      
      [   16.863014] ======================================================
      [   16.869183] [ INFO: possible circular locking dependency detected ]
      [   16.875441] 4.1.6-01265-gfb1e101 #1 Tainted: G        W
      [   16.881176] -------------------------------------------------------
      [   16.887432] ifconfig/1662 is trying to acquire lock:
      [   16.892386]  (netcp_modules_lock){+.+.+.}, at: [<c03e8110>]
      netcp_ndo_open+0x168/0x518
      [   16.900321]
      [   16.900321] but task is already holding lock:
      [   16.906144]  (rtnl_mutex){+.+.+.}, at: [<c053a418>] devinet_ioctl+0xf8/0x7e4
      [   16.913206]
      [   16.913206] which lock already depends on the new lock.
      [   16.913206]
      [   16.921372]
      [   16.921372] the existing dependency chain (in reverse order) is:
      [   16.928844]
      -> #1 (rtnl_mutex){+.+.+.}:
      [   16.932865]        [<c06023f0>] mutex_lock_nested+0x68/0x4a8
      [   16.938521]        [<c04c5758>] register_netdev+0xc/0x24
      [   16.943831]        [<c03e65c0>] netcp_module_probe+0x214/0x2ec
      [   16.949660]        [<c03e8a54>] netcp_register_module+0xd4/0x140
      [   16.955663]        [<c089654c>] keystone_gbe_init+0x10/0x28
      [   16.961233]        [<c000977c>] do_one_initcall+0xb8/0x1f8
      [   16.966714]        [<c0867e04>] kernel_init_freeable+0x148/0x1e8
      [   16.972720]        [<c05f9994>] kernel_init+0xc/0xe8
      [   16.977682]        [<c0010038>] ret_from_fork+0x14/0x3c
      [   16.982905]
      -> #0 (netcp_modules_lock){+.+.+.}:
      [   16.987619]        [<c006eab0>] lock_acquire+0x118/0x320
      [   16.992928]        [<c06023f0>] mutex_lock_nested+0x68/0x4a8
      [   16.998582]        [<c03e8110>] netcp_ndo_open+0x168/0x518
      [   17.004064]        [<c04c48f0>] __dev_open+0xa8/0x10c
      [   17.009112]        [<c04c4b74>] __dev_change_flags+0x94/0x144
      [   17.014853]        [<c04c4c3c>] dev_change_flags+0x18/0x48
      [   17.020334]        [<c053a9fc>] devinet_ioctl+0x6dc/0x7e4
      [   17.025729]        [<c04a59ec>] sock_ioctl+0x1d0/0x2a8
      [   17.030865]        [<c0142844>] do_vfs_ioctl+0x41c/0x688
      [   17.036173]        [<c0142ae4>] SyS_ioctl+0x34/0x5c
      [   17.041046]        [<c000ff60>] ret_fast_syscall+0x0/0x54
      [   17.046441]
      [   17.046441] other info that might help us debug this:
      [   17.046441]
      [   17.054434]  Possible unsafe locking scenario:
      [   17.054434]
      [   17.060343]        CPU0                    CPU1
      [   17.064862]        ----                    ----
      [   17.069381]   lock(rtnl_mutex);
      [   17.072522]                                lock(netcp_modules_lock);
      [   17.078875]                                lock(rtnl_mutex);
      [   17.084532]   lock(netcp_modules_lock);
      [   17.088366]
      [   17.088366]  *** DEADLOCK ***
      [   17.088366]
      [   17.094279] 1 lock held by ifconfig/1662:
      [   17.098278]  #0:  (rtnl_mutex){+.+.+.}, at: [<c053a418>]
      devinet_ioctl+0xf8/0x7e4
      [   17.105774]
      [   17.105774] stack backtrace:
      [   17.110124] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   17.118637] Hardware name: Keystone
      [   17.122123] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   17.129862] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   17.137079] [<c05ff450>] (dump_stack) from [<c0068e34>]
      (print_circular_bug+0x210/0x330)
      [   17.145161] [<c0068e34>] (print_circular_bug) from [<c006ab7c>]
      (validate_chain.isra.35+0xf98/0x13ac)
      [   17.154372] [<c006ab7c>] (validate_chain.isra.35) from [<c006da60>]
      (__lock_acquire+0x52c/0xcc0)
      [   17.163149] [<c006da60>] (__lock_acquire) from [<c006eab0>]
      (lock_acquire+0x118/0x320)
      [   17.171058] [<c006eab0>] (lock_acquire) from [<c06023f0>]
      (mutex_lock_nested+0x68/0x4a8)
      [   17.179140] [<c06023f0>] (mutex_lock_nested) from [<c03e8110>]
      (netcp_ndo_open+0x168/0x518)
      [   17.187484] [<c03e8110>] (netcp_ndo_open) from [<c04c48f0>]
      (__dev_open+0xa8/0x10c)
      [   17.195133] [<c04c48f0>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   17.203129] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   17.211560] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   17.219729] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   17.227378] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   17.234939] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   17.242242] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   17.258855] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      [   17.271282] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   17.279712] in_atomic(): 1, irqs_disabled(): 0, pid: 1662, name: ifconfig
      [   17.286500] INFO: lockdep is turned off.
      [   17.290413] Preemption disabled at:[<  (null)>]   (null)
      [   17.295728]
      [   17.297214] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   17.305735] Hardware name: Keystone
      [   17.309223] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   17.316970] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   17.324194] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   17.332112] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   17.340724] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>]
      (dev_set_rx_mode+0x1c/0x28)
      [   17.348982] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>]
      (__dev_open+0xc4/0x10c)
      [   17.356724] [<c04c490c>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   17.364729] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   17.373166] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   17.381344] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   17.388994] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   17.396563] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   17.403873] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   17.413772] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
      udhcpc (v1.20.2) started
      Sending discover...
      [   18.690666] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      Sending discover...
      [   22.250972] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow
      control off
      [   22.258721] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
      [   22.265458] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   22.273896] in_atomic(): 1, irqs_disabled(): 0, pid: 342, name: kworker/1:1
      [   22.280854] INFO: lockdep is turned off.
      [   22.284767] Preemption disabled at:[<  (null)>]   (null)
      [   22.290074]
      [   22.291568] CPU: 1 PID: 342 Comm: kworker/1:1 Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   22.300255] Hardware name: Keystone
      [   22.303750] Workqueue: ipv6_addrconf addrconf_dad_work
      [   22.308895] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   22.316643] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   22.323867] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   22.331786] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   22.340394] [<c03e9840>] (netcp_set_rx_mode) from [<c04c9d18>]
      (__dev_mc_add+0x54/0x68)
      [   22.348401] [<c04c9d18>] (__dev_mc_add) from [<c05ab358>]
      (igmp6_group_added+0x168/0x1b4)
      [   22.356580] [<c05ab358>] (igmp6_group_added) from [<c05ad2cc>]
      (ipv6_dev_mc_inc+0x4f0/0x5a8)
      [   22.365019] [<c05ad2cc>] (ipv6_dev_mc_inc) from [<c058f0d0>]
      (addrconf_dad_work+0x21c/0x33c)
      [   22.373460] [<c058f0d0>] (addrconf_dad_work) from [<c0042850>]
      (process_one_work+0x214/0x8d0)
      [   22.381986] [<c0042850>] (process_one_work) from [<c0042f54>]
      (worker_thread+0x48/0x4bc)
      [   22.390071] [<c0042f54>] (worker_thread) from [<c004868c>]
      (kthread+0xf0/0x108)
      [   22.397381] [<c004868c>] (kthread) from [<c0010038>]
      
      Trace related to incorrect usage of mutex inside ndo_set_rx_mode
      
      [   24.086066] BUG: sleeping function called from invalid context at
      kernel/locking/mutex.c:616
      [   24.094506] in_atomic(): 1, irqs_disabled(): 0, pid: 1682, name: ifconfig
      [   24.101291] INFO: lockdep is turned off.
      [   24.105203] Preemption disabled at:[<  (null)>]   (null)
      [   24.110511]
      [   24.112005] CPU: 2 PID: 1682 Comm: ifconfig Tainted: G        W
      4.1.6-01265-gfb1e101 #1
      [   24.120518] Hardware name: Keystone
      [   24.124018] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>]
      (show_stack+0x10/0x14)
      [   24.131772] [<c0013cbc>] (show_stack) from [<c05ff450>]
      (dump_stack+0x84/0xc4)
      [   24.138989] [<c05ff450>] (dump_stack) from [<c06023b0>]
      (mutex_lock_nested+0x28/0x4a8)
      [   24.146908] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>]
      (netcp_set_rx_mode+0x160/0x210)
      [   24.155523] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>]
      (dev_set_rx_mode+0x1c/0x28)
      [   24.163787] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>]
      (__dev_open+0xc4/0x10c)
      [   24.171531] [<c04c490c>] (__dev_open) from [<c04c4b74>]
      (__dev_change_flags+0x94/0x144)
      [   24.179528] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>]
      (dev_change_flags+0x18/0x48)
      [   24.187966] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>]
      (devinet_ioctl+0x6dc/0x7e4)
      [   24.196145] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>]
      (sock_ioctl+0x1d0/0x2a8)
      [   24.203803] [<c04a59ec>] (sock_ioctl) from [<c0142844>]
      (do_vfs_ioctl+0x41c/0x688)
      [   24.211373] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>]
      (SyS_ioctl+0x34/0x5c)
      [   24.218676] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>]
      (ret_fast_syscall+0x0/0x54)
      [   24.227156] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ceaf361
    • K
      net: netcp: allocate buffers to desc before re-enable interrupt · 99f8ef5d
      Karicheri, Muralidharan 提交于
      Currently netcp_rxpool_refill() that refill descriptors and attached
      buffers to fdq while interrupt is enabled as part of NAPI poll. Doing
      it while interrupt is disabled could be beneficial as hardware will
      not be starved when CPU is busy with processing interrupt.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99f8ef5d
    • K
      net: netcp: check for interface handle in netcp_module_probe() · 915c5857
      Karicheri, Muralidharan 提交于
      Currently netcp_module_probe() doesn't check the return value of
      of_parse_phandle() that points to the interface data for the
      module and then pass the node ptr to the module which is incorrect.
      Check for return value and free the intf_modpriv if there is error.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      915c5857
    • K
      net: netcp: add error check to netcp_allocate_rx_buf() · e558b1fb
      Karicheri, Muralidharan 提交于
      Currently, if netcp_allocate_rx_buf() fails due no descriptors
      in the rx free descriptor queue, inside the netcp_rxpool_refill() function
      the iterative loop to fill buffers doesn't terminate right away. So modify
      the netcp_allocate_rx_buf() to return an error code and use it break the
      loop when there is error.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e558b1fb
    • K
      net: netcp: move netcp_register_interface() to after attach module · 736532a0
      Karicheri, Muralidharan 提交于
      The netcp interface is not fully initialized before attach the module
      to the interface. For example, the tx pipe/rx pipe is initialized
      in ethss module as part of attach(). So until this is complete, the
      interface can't be registered.  So move registration of interface to
      net device outside the current loop that attaches the modules to the
      interface.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      736532a0
    • K
      net: netcp: remove dead code from the driver · 156e3c21
      Karicheri, Muralidharan 提交于
      netcp_core is the first driver that will get initialized and the modules
      (ethss, pa etc) will then get initialized. So the code at the end of
      netcp_probe() that iterate over the modules is a dead code as the module
      list will be always be empty. So remove this code.
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      156e3c21
    • W
      net: netcp: ethss: fix error in calling sgmii api with incorrect offset · 8c85151d
      WingMan Kwok 提交于
      On K2HK, sgmii module registers of slave 0 and 1 are mem
      mapped to one contiguous block, while those of slave 2
      and 3 are mapped to another contiguous block.  However,
      on K2E and K2L, sgmii module registers of all slaves are
      mem mapped to one contiguous block.  SGMII APIs expect
      slave 0 sgmii base when API is invoked for slave 0 and 1,
      and slave 2 sgmii base when invoked for other slaves.
      Before this patch, slave 0 sgmii base is always passed to
      sgmii API for K2E regardless which slave is the API invoked
      for.  This patch fixes the problem.
      Signed-off-by: NWingMan Kwok <w-kwok2@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c85151d
    • D
      Fix AF_PACKET ABI breakage in 4.2 · d3869efe
      David Woodhouse 提交于
      Commit 7d824109 ("virtio: add explicit big-endian support to memory
      accessors") accidentally changed the virtio_net header used by
      AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.
      
      Since virtio_legacy_is_little_endian() is a very long identifier,
      define a vio_le macro and use that throughout the code instead of the
      hard-coded 'false' for little-endian.
      
      This restores the ABI to match 4.1 and earlier kernels, and makes my
      test program work again.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3869efe
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
    • E
      tcp: add proper TS val into RST packets · 675ee231
      Eric Dumazet 提交于
      RST packets sent on behalf of TCP connections with TS option (RFC 7323
      TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
      
      A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
      ecr 0], length 0
      B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
      1460,nop,nop,TS val 7264344 ecr 100], length 0
      A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
      7264344], length 0
      
      B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
      ecr 110], length 0
      
      We need to call skb_mstamp_get() to get proper TS val,
      derived from skb->skb_mstamp
      
      Note that RFC 1323 was advocating to not send TS option in RST segment,
      but RFC 7323 recommends the opposite :
      
        Once TSopt has been successfully negotiated, that is both <SYN> and
        <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
        segment for the duration of the connection, and SHOULD be sent in an
        <RST> segment (see Section 5.2 for details)
      
      Note this RFC recommends to send TS val = 0, but we believe it is
      premature : We do not know if all TCP stacks are properly
      handling the receive side :
      
         When an <RST> segment is
         received, it MUST NOT be subjected to the PAWS check by verifying an
         acceptable value in SEG.TSval, and information from the Timestamps
         option MUST NOT be used to update connection state information.
         SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
      
      In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
      to decide to send TS val = 0, if it buys something.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      675ee231