1. 02 10月, 2015 2 次提交
    • G
      memcg: remove pcp_counter_lock · ef510194
      Greg Thelen 提交于
      Commit 733a572e ("memcg: make mem_cgroup_read_{stat|event}() iterate
      possible cpus instead of online") removed the last use of the per memcg
      pcp_counter_lock but forgot to remove the variable.
      
      Kill the vestigial variable.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef510194
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  2. 01 10月, 2015 2 次提交
  3. 30 9月, 2015 2 次提交
    • P
      skbuff: Fix skb checksum partial check. · 31b33dfb
      Pravin B Shelar 提交于
      Earlier patch 6ae459bd tried to detect void ckecksum partial
      skb by comparing pull length to checksum offset. But it does
      not work for all cases since checksum-offset depends on
      updates to skb->data.
      
      Following patch fixes it by validating checksum start offset
      after skb-data pointer is updated. Negative value of checksum
      offset start means there is no need to checksum.
      
      Fixes: 6ae459bd ("skbuff: Fix skb checksum flag on skb pull")
      Reported-by: NAndrew Vagin <avagin@odin.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31b33dfb
    • A
      blk-mq: fix sysfs registration/unregistration race · 4593fdbe
      Akinobu Mita 提交于
      There is a race between cpu hotplug handling and adding/deleting
      gendisk for blk-mq, where both are trying to register and unregister
      the same sysfs entries.
      
      null_add_dev
          --> blk_mq_init_queue
              --> blk_mq_init_allocated_queue
                  --> add to 'all_q_list' (*)
          --> add_disk
              --> blk_register_queue
                  --> blk_mq_register_disk (++)
      
      null_del_dev
          --> del_gendisk
              --> blk_unregister_queue
                  --> blk_mq_unregister_disk (--)
          --> blk_cleanup_queue
              --> blk_mq_free_queue
                  --> del from 'all_q_list' (*)
      
      blk_mq_queue_reinit
          --> blk_mq_sysfs_unregister (-)
          --> blk_mq_sysfs_register (+)
      
      While the request queue is added to 'all_q_list' (*),
      blk_mq_queue_reinit() can be called for the queue anytime by CPU
      hotplug callback.  But blk_mq_sysfs_unregister (-) and
      blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
      before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
      is finished.  Because '/sys/block/*/mq/' is not exists.
      
      There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
      be used to track these sysfs stuff, but it is only fixing this issue
      partially.
      
      In order to fix it completely, we just need per-queue flag instead of
      per-hctx flag with appropriate locking.  So this introduces
      q->mq_sysfs_init_done which is properly protected with all_q_mutex.
      
      Also, we need to ensure that blk_mq_map_swqueue() is called with
      all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
      updated in blk_mq_map_swqueue(), so we should avoid
      blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
      in CPU hotplug handling or adding/deleting gendisk .
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4593fdbe
  4. 26 9月, 2015 1 次提交
  5. 25 9月, 2015 5 次提交
    • S
      IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY · c6790aa9
      Sagi Grimberg 提交于
      Commit 96249d70 ("IB/core: Guarantee that a local_dma_lkey
      is available") allows ULPs that make use of the local dma key to keep
      working as before by allocating a DMA MR with local permissions and
      converted these consumers to use the MR associated with the PD
      rather then device->local_dma_lkey.
      
      ConnectIB has some known issues with memory registration
      using the local_dma_lkey (SEND, RDMA, RECV seems to work ok).
      
      Thus don't expose support for it (remove device->local_dma_lkey
      setting), and take advantage of the above commit such that no regression
      is introduced to working systems.
      
      The local_dma_lkey support will be restored in CX4 depending on FW
      capability query.
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c6790aa9
    • R
      phy: add phy_device_remove() · 38737e49
      Russell King 提交于
      Add a phy_device_remove() function to complement phy_device_register(),
      which undoes the effects of phy_device_register() by removing the phy
      device from visibility, but not freeing it.
      
      This allows these details to be moved out of the mdio bus code into
      the phy code where this action belongs.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38737e49
    • R
      phy: fix mdiobus module safety · 3e3aaf64
      Russell King 提交于
      Re-implement the mdiobus module refcounting to ensure that we actually
      ensure that the mdiobus module code does not go away while we might call
      into it.
      
      The old scheme using bus->dev.driver was buggy, because bus->dev is a
      class device which never has a struct device_driver associated with it,
      and hence the associated code trying to obtain a refcount did nothing
      useful.
      
      Instead, take the approach that other subsystems do: pass the module
      when calling mdiobus_register(), and record that in the mii_bus struct.
      When we need to increment the module use count in the phy code, use
      this stored pointer.  When the phy is deteched, drop the module
      refcount, remembering that the phy device might go away at that point.
      
      This doesn't stop the mii_bus going away while there are in-use phys -
      it merely stops the underlying code vanishing.
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e3aaf64
    • P
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar 提交于
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      
      Following is the kernel panic msg from old kernel hitting the bug.
      
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      <IRQ>
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: NAnupam Chanda <achanda@vmware.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae459bd
    • T
      cgroup, writeback: don't enable cgroup writeback on traditional hierarchies · 9badce00
      Tejun Heo 提交于
      inode_cgwb_enabled() gates cgroup writeback support.  If it returns
      true, each inode is attached to the corresponding memory domain which
      gets mapped to io domain.  It currently only tests whether the
      filesystem and bdi support cgroup writeback; however, cgroup writeback
      support doesn't work on traditional hierarchies and thus it should
      also test whether memcg and iocg are on the default hierarchy.
      
      This caused traditional hierarchy setups to hit the cgroup writeback
      path inadvertently and ended up creating separate writeback domains
      for each memcg and mapping them all to the root iocg uncovering a
      couple issues in the cgroup writeback path.
      
      cgroup writeback was never meant to be enabled on traditional
      hierarchies.  Make inode_cgwb_enabled() test whether both memcg and
      iocg are on the default hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Reported-by: NDexuan Cui <decui@microsoft.com>
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Link: http://lkml.kernel.org/g/f30d4a6aa8a546ff88f73021d026a453@SIXPR30MB031.064d.mgd.msft.net
      9badce00
  6. 24 9月, 2015 1 次提交
    • N
      netpoll: Close race condition between poll_one_napi and napi_disable · 2d8bff12
      Neil Horman 提交于
      Drivers might call napi_disable while not holding the napi instance poll_lock.
      In those instances, its possible for a race condition to exist between
      poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
      NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
      the following may happen:
      
      CPU0				CPU1
      ndo_tx_timeout			napi_poll_dev
       napi_disable			 poll_one_napi
        test_and_set_bit (ret 0)
      				  test_bit (ret 1)
         reset adapter		   napi_poll_routine
      
      If the adapter gets a tx timeout without a napi instance scheduled, its possible
      for the adapter to think it has exclusive access to the hardware  (as the napi
      instance is now scheduled via the napi_disable call), while the netpoll code
      thinks there is simply work to do.  The result is parallel hardware access
      leading to corrupt data structures in the driver, and a crash.
      
      Additionaly, there is another, more critical race between netpoll and
      napi_disable.  The disabled napi state is actually identical to the scheduled
      state for a given napi instance.  The implication being that, if a napi instance
      is disabled, a netconsole instance would see the napi state of the device as
      having been scheduled, and poll it, likely while the driver was dong something
      requiring exclusive access.  In the case above, its fairly clear that not having
      the rings in a state ready to be polled will cause any number of crashes.
      
      The fix should be pretty easy.  netpoll uses its own bit to indicate that that
      the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
      We can just gate disabling on that bit as well as the sched bit.  That should
      prevent netpoll from conducting a napi poll if we convert its set bit to a
      test_and_set_bit operation to provide mutual exclusion
      
      Change notes:
      V2)
      	Remove a trailing whtiespace
      	Resubmit with proper subject prefix
      
      V3)
      	Clean up spacing nits
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jmaxwell@redhat.com
      Tested-by: jmaxwell@redhat.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8bff12
  7. 23 9月, 2015 1 次提交
  8. 21 9月, 2015 2 次提交
  9. 18 9月, 2015 3 次提交
  10. 17 9月, 2015 1 次提交
  11. 16 9月, 2015 11 次提交
  12. 15 9月, 2015 4 次提交
  13. 14 9月, 2015 3 次提交
  14. 13 9月, 2015 1 次提交
    • L
      blk: rq_data_dir() should not return a boolean · 10fbd36e
      Linus Torvalds 提交于
      rq_data_dir() returns either READ or WRITE (0 == READ, 1 == WRITE), not
      a boolean value.
      
      Now, admittedly the "!= 0" doesn't really change the value (0 stays as
      zero, 1 stays as one), but it's not only redundant, it confuses gcc, and
      causes gcc to warn about the construct
      
          switch (rq_data_dir(req)) {
              case READ:
                  ...
              case WRITE:
                  ...
      
      that we have in a few drivers.
      
      Now, the gcc warning is silly and stupid (it seems to warn not about the
      switch value having a different type from the case statements, but about
      _any_ boolean switch value), but in this case the code itself is silly
      and stupid too, so let's just change it, and get rid of warnings like
      this:
      
        drivers/block/hd.c: In function ‘hd_request’:
        drivers/block/hd.c:630:11: warning: switch condition has boolean value [-Wswitch-bool]
           switch (rq_data_dir(req)) {
      
      The odd '!= 0' came in when "cmd_flags" got turned into a "u64" in
      commit 5953316d ("block: make rq->cmd_flags be 64-bit") and is
      presumably because the old code (that just did a logical 'and' with 1)
      would then end up making the type of rq_data_dir() be u64 too.
      
      But if we want to retain the old regular integer type, let's just cast
      the result to 'int' rather than use that rather odd '!= 0'.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10fbd36e
  15. 12 9月, 2015 1 次提交
    • J
      fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void · 6798a8ca
      Joe Perches 提交于
      The seq_<foo> function return values were frequently misused.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      All uses of these return values have been removed, so convert the
      return types to void.
      
      Miscellanea:
      
      o Move seq_put_decimal_<type> and seq_escape prototypes closer the
        other seq_vprintf prototypes
      o Reorder seq_putc and seq_puts to return early on overflow
      o Add argument names to seq_vprintf and seq_printf
      o Update the seq_escape kernel-doc
      o Convert a couple of leading spaces to tabs in seq_escape
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6798a8ca