You need to sign in or sign up before continuing.
  1. 02 10月, 2015 2 次提交
    • G
      memcg: remove pcp_counter_lock · ef510194
      Greg Thelen 提交于
      Commit 733a572e ("memcg: make mem_cgroup_read_{stat|event}() iterate
      possible cpus instead of online") removed the last use of the per memcg
      pcp_counter_lock but forgot to remove the variable.
      
      Kill the vestigial variable.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef510194
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  2. 30 9月, 2015 10 次提交
  3. 29 9月, 2015 1 次提交
  4. 26 9月, 2015 1 次提交
  5. 25 9月, 2015 5 次提交
    • S
      IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY · c6790aa9
      Sagi Grimberg 提交于
      Commit 96249d70 ("IB/core: Guarantee that a local_dma_lkey
      is available") allows ULPs that make use of the local dma key to keep
      working as before by allocating a DMA MR with local permissions and
      converted these consumers to use the MR associated with the PD
      rather then device->local_dma_lkey.
      
      ConnectIB has some known issues with memory registration
      using the local_dma_lkey (SEND, RDMA, RECV seems to work ok).
      
      Thus don't expose support for it (remove device->local_dma_lkey
      setting), and take advantage of the above commit such that no regression
      is introduced to working systems.
      
      The local_dma_lkey support will be restored in CX4 depending on FW
      capability query.
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c6790aa9
    • R
      phy: add phy_device_remove() · 38737e49
      Russell King 提交于
      Add a phy_device_remove() function to complement phy_device_register(),
      which undoes the effects of phy_device_register() by removing the phy
      device from visibility, but not freeing it.
      
      This allows these details to be moved out of the mdio bus code into
      the phy code where this action belongs.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38737e49
    • R
      phy: fix mdiobus module safety · 3e3aaf64
      Russell King 提交于
      Re-implement the mdiobus module refcounting to ensure that we actually
      ensure that the mdiobus module code does not go away while we might call
      into it.
      
      The old scheme using bus->dev.driver was buggy, because bus->dev is a
      class device which never has a struct device_driver associated with it,
      and hence the associated code trying to obtain a refcount did nothing
      useful.
      
      Instead, take the approach that other subsystems do: pass the module
      when calling mdiobus_register(), and record that in the mii_bus struct.
      When we need to increment the module use count in the phy code, use
      this stored pointer.  When the phy is deteched, drop the module
      refcount, remembering that the phy device might go away at that point.
      
      This doesn't stop the mii_bus going away while there are in-use phys -
      it merely stops the underlying code vanishing.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3aaf64
    • P
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar 提交于
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      
      Following is the kernel panic msg from old kernel hitting the bug.
      
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      <IRQ>
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: NAnupam Chanda <achanda@vmware.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae459bd
    • T
      cgroup, writeback: don't enable cgroup writeback on traditional hierarchies · 9badce00
      Tejun Heo 提交于
      inode_cgwb_enabled() gates cgroup writeback support.  If it returns
      true, each inode is attached to the corresponding memory domain which
      gets mapped to io domain.  It currently only tests whether the
      filesystem and bdi support cgroup writeback; however, cgroup writeback
      support doesn't work on traditional hierarchies and thus it should
      also test whether memcg and iocg are on the default hierarchy.
      
      This caused traditional hierarchy setups to hit the cgroup writeback
      path inadvertently and ended up creating separate writeback domains
      for each memcg and mapping them all to the root iocg uncovering a
      couple issues in the cgroup writeback path.
      
      cgroup writeback was never meant to be enabled on traditional
      hierarchies.  Make inode_cgwb_enabled() test whether both memcg and
      iocg are on the default hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Reported-by: NDexuan Cui <decui@microsoft.com>
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net
      9badce00
  6. 24 9月, 2015 2 次提交
    • S
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
  7. 23 9月, 2015 9 次提交
  8. 22 9月, 2015 1 次提交
    • Y
      tcp: usec resolution SYN/ACK RTT · 0f1c28ae
      Yuchung Cheng 提交于
      Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
      RTT is often measured as 0ms or sometimes 1ms, which would affect
      RTT estimation and min RTT samping used by some congestion control.
      
      This patch improves SYN/ACK RTT to be usec resolution if platform
      supports it. While the timestamping of SYN/ACK is done in request
      sock, the RTT measurement is carefully arranged to avoid storing
      another u64 timestamp in tcp_sock.
      
      For regular handshake w/o SYNACK retransmission, the RTT is sampled
      right after the child socket is created and right before the request
      sock is released (tcp_check_req() in tcp_minisocks.c)
      
      For Fast Open the child socket is already created when SYN/ACK was
      sent, the RTT is sampled in tcp_rcv_state_process() after processing
      the final ACK an right before the request socket is released.
      
      If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
      on TCP timestamps to measure the RTT. The sample is taken at the
      same place in tcp_rcv_state_process() after the timestamp values
      are validated in tcp_validate_incoming(). Note that we do not store
      TS echo value in request_sock for SYN-cookies, because the value
      is already stored in tp->rx_opt used by tcp_ack_update_rtt().
      
      One side benefit is that the RTT measurement now happens before
      initializing congestion control (of the passive side). Therefore
      the congestion control can use the SYN/ACK RTT.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f1c28ae
  9. 21 9月, 2015 3 次提交
  10. 19 9月, 2015 4 次提交
  11. 18 9月, 2015 2 次提交
    • E
      tcp: provide skb->hash to synack packets · 58d607d3
      Eric Dumazet 提交于
      In commit b73c3d0e ("net: Save TX flow hash in sock and set in skbuf
      on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
      
      We'd like to provide one as well for SYNACK packets, so that all packets
      of a given flow share same txhash, to later enable bonding driver to
      also use skb->hash to perform slave selection.
      
      Note that a SYNACK retransmit shuffles the tx hash, as Tom did
      in commit 265f94ff ("net: Recompute sk_txhash on negative routing
      advice") for established sockets.
      
      This has nice effect making TCP flows resilient to some kind of black
      holes, even at connection establish phase.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58d607d3
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0