1. 02 6月, 2015 11 次提交
    • T
      memcg: implement mem_cgroup_css_from_page() · ad7fa852
      Tejun Heo 提交于
      Implement mem_cgroup_css_from_page() which returns the
      cgroup_subsys_state of the memcg associated with a given page on the
      default hierarchy.  This will be used by cgroup writeback support.
      
      This function assumes that page->mem_cgroup association doesn't change
      until the page is released, which is true on the default hierarchy as
      long as replace_page_cache_page() is not used.  As the only user of
      replace_page_cache_page() is FUSE which won't support cgroup writeback
      for the time being, this works for now, and replace_page_cache_page()
      will soon be updated so that the invariant actually holds.
      
      Note that the RCU protected page->mem_cgroup access is consistent with
      other usages across memcg but ultimately incorrect.  These unlocked
      accesses are missing required barriers.  page->mem_cgroup should be
      made an RCU pointer and updated and accessed using RCU operations.
      
      v4: Instead of triggering WARN, return the root css on the traditional
          hierarchies.  This makes the function a lot easier to deal with
          especially as there's no light way to synchronize against
          hierarchy rebinding.
      
      v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/
      
      v2: Trigger WARN if the function is used on the traditional
          hierarchies and add comment about the assumed invariant.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ad7fa852
    • T
      blkcg: implement bio_associate_blkcg() · 1d933cf0
      Tejun Heo 提交于
      Currently, a bio can only be associated with the io_context and blkcg
      of %current using bio_associate_current().  This is too restrictive
      for cgroup writeback support.  Implement bio_associate_blkcg() which
      associates a bio with the specified blkcg.
      
      bio_associate_blkcg() leaves the io_context unassociated.
      bio_associate_current() is updated so that it considers a bio as
      already associated if it has a blkcg_css, instead of an io_context,
      associated with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1d933cf0
    • T
      blkcg: implement task_get_blkcg_css() · fd383c2d
      Tejun Heo 提交于
      Implement a wrapper around task_get_css() to acquire the blkcg css for
      a given task.  The wrapper is necessary for cgroup writeback support
      as there will be places outside blkcg proper trying to acquire
      blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fd383c2d
    • T
      cgroup, block: implement task_get_css() and use it in bio_associate_current() · ec438699
      Tejun Heo 提交于
      bio_associate_current() currently open codes task_css() and
      css_tryget_online() to find and pin $current's blkcg css.  Abstract it
      into task_get_css() which is implemented from cgroup side.  As a task
      is always associated with an online css for every subsystem except
      while the css_set update is propagating, task_get_css() retries till
      css_tryget_online() succeeds.
      
      This is a cleanup and shouldn't lead to noticeable behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec438699
    • T
      blkcg: add blkcg_root_css · 496d5e75
      Tejun Heo 提交于
      Add global constant blkcg_root_css which points to &blkcg_root.css.
      This will be used by cgroup writeback support.  If blkcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      
      v2: The declarations moved to include/linux/blk-cgroup.h as suggested
          by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      496d5e75
    • T
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo 提交于
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      56161634
    • T
      update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h · efa7d1c7
      Tejun Heo 提交于
      The header file will be used more widely with the pending cgroup
      writeback support and the current set of dummy declarations aren't
      enough to handle different config combinations.  Update as follows.
      
      * Drop the struct cgroup declaration.  None of the dummy defs need it.
      
      * Define blkcg as an empty struct instead of just declaring it.
      
      * Wrap dummy function defs in CONFIG_BLOCK.  Some functions use block
        data types and none of them are to be used w/o block enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      efa7d1c7
    • T
      blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h · eea8f41c
      Tejun Heo 提交于
      cgroup aware writeback support will require exposing some of blkcg
      details.  In preprataion, move block/blk-cgroup.h to
      include/linux/blk-cgroup.h.  This patch is pure file move.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eea8f41c
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
    • K
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch 提交于
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f26cdc85
  2. 30 5月, 2015 1 次提交
  3. 22 5月, 2015 2 次提交
    • C
      block, dm: don't copy bios for request clones · 5f1b670d
      Christoph Hellwig 提交于
      Currently dm-multipath has to clone the bios for every request sent
      to the lower devices, which wastes cpu cycles and ties down memory.
      
      This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
      to not complete bios attached to a request, which we set on clone
      requests similar to bios in a flush sequence.  With this change I/O
      errors on a path failure only get propagated to dm-multipath, which
      can then either resubmit the I/O or complete the bios on the original
      request.
      
      I've done some basic testing of this on a Linux target with ALUA support,
      and it survives path failures during I/O nicely.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5f1b670d
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  4. 20 5月, 2015 1 次提交
    • J
      block: export blkdev_reread_part() and __blkdev_reread_part() · be324177
      Jarod Wilson 提交于
      This patch exports blkdev_reread_part() for block drivers, also
      introduce __blkdev_reread_part().
      
      For some drivers, such as loop, reread of partitions can be run
      from the release path, and bd_mutex may already be held prior to
      calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
      __blkdev_reread_part for use in such cases.
      
      CC: Christoph Hellwig <hch@lst.de>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Tejun Heo <tj@kernel.org>
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
      CC: Markus Pargmann <mpa@pengutronix.de>
      CC: Stefan Weinhuber <wein@de.ibm.com>
      CC: Stefan Haberland <stefan.haberland@de.ibm.com>
      CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
      CC: Fabian Frederick <fabf@skynet.be>
      CC: Ming Lei <ming.lei@canonical.com>
      CC: David Herrmann <dh.herrmann@gmail.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: nbd-general@lists.sourceforge.net
      CC: linux-s390@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      be324177
  5. 19 5月, 2015 5 次提交
  6. 06 5月, 2015 8 次提交
  7. 04 5月, 2015 1 次提交
    • D
      lib: make memzero_explicit more robust against dead store elimination · 7829fb09
      Daniel Borkmann 提交于
      In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
      of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
      case LTO would decide to inline memzero_explicit() and eventually
      find out it could be elimiated as dead store.
      
      While using barrier() works well for the case of gcc, recent efforts
      from LLVMLinux people suggest to use llvm as an alternative to gcc,
      and there, Stephan found in a simple stand-alone user space example
      that llvm could nevertheless optimize and thus elimitate the memset().
      A similar issue has been observed in the referenced llvm bug report,
      which is regarded as not-a-bug.
      
      Based on some experiments, icc is a bit special on its own, while it
      doesn't seem to eliminate the memset(), it could do so with an own
      implementation, and then result in similar findings as with llvm.
      
      The fix in this patch now works for all three compilers (also tested
      with more aggressive optimization levels). Arguably, in the current
      kernel tree it's more of a theoretical issue, but imho, it's better
      to be pedantic about it.
      
      It's clearly visible with gcc/llvm though, with the below code: if we
      would have used barrier() only here, llvm would have omitted clearing,
      not so with barrier_data() variant:
      
        static inline void memzero_explicit(void *s, size_t count)
        {
          memset(s, 0, count);
          barrier_data(s);
        }
      
        int main(void)
        {
          char buff[20];
          memzero_explicit(buff, sizeof(buff));
          return 0;
        }
      
        $ gcc -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
         0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
         0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
         0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
         0x000000000040041f <+31>: xor   %eax,%eax
         0x0000000000400421 <+33>: retq
        End of assembler dump.
      
        $ clang -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
         0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
         0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
         0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
         0x0000000000400505 <+21>: xor    %eax,%eax
         0x0000000000400507 <+23>: retq
        End of assembler dump.
      
      As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
      this in compiler-gcc.h only to be picked up. For a fallback or otherwise
      unsupported compiler, we define it as a barrier. Similarly, for ecc which
      does not support gcc inline asm.
      
      Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
      Tested-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: mancha security <mancha1@zoho.com>
      Cc: Mark Charlebois <charlebm@gmail.com>
      Cc: Behan Webster <behanw@converseincode.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7829fb09
  8. 02 5月, 2015 1 次提交
  9. 30 4月, 2015 1 次提交
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  10. 29 4月, 2015 2 次提交
  11. 28 4月, 2015 4 次提交
    • R
      ASoC: Update email-id of Rajeev Kumar · 9d7dd6cd
      Rajeev Kumar 提交于
      rajeev-dlh.kumar@st.com email-id doesn't exist anymore as I have left the
      company.  Replace ST's id with Rajeev Kumar <rajeevkumar.linux@gmail.com>
      Signed-off-by: NRajeev Kumar <rajeevkumar.linux@gmail.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      9d7dd6cd
    • F
      tty: Re-add external interface for tty_set_termios() · b00f5c2d
      Frederic Danis 提交于
      This is needed by Bluetooth hci_uart module to be able to change speed
      of Bluetooth controller and local UART.
      Signed-off-by: NFrederic Danis <frederic.danis@linux.intel.com>
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b00f5c2d
    • H
      uas: Add US_FL_MAX_SECTORS_240 flag · ee136af4
      Hans de Goede 提交于
      The usb-storage driver sets max_sectors = 240 in its scsi-host template,
      for uas we do not want to do that for all devices, but testing has shown
      that some devices need it.
      
      This commit adds a US_FL_MAX_SECTORS_240 flag for such devices, and
      implements support for it in uas.c, while at it it also adds support
      for US_FL_MAX_SECTORS_64 to uas.c.
      
      Cc: stable@vger.kernel.org # 3.16
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee136af4
    • M
      SCSI: add 1024 max sectors black list flag · 35e9a9f9
      Mike Christie 提交于
      This works around a issue with qnap iscsi targets not handling large IOs
      very well.
      
      The target returns:
      
      VPD INQUIRY: Block limits page (SBC)
        Maximum compare and write length: 1 blocks
        Optimal transfer length granularity: 1 blocks
        Maximum transfer length: 4294967295 blocks
        Optimal transfer length: 4294967295 blocks
        Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
        Maximum unmap LBA count: 8388607
        Maximum unmap block descriptor count: 1
        Optimal unmap granularity: 16383
        Unmap granularity alignment valid: 0
        Unmap granularity alignment: 0
        Maximum write same length: 0xffffffff blocks
        Maximum atomic transfer length: 0
        Atomic alignment: 0
        Atomic transfer length granularity: 0
      
      and it is *sometimes* able to handle at least one IO of size up to 8 MB. We
      have seen in traces where it will sometimes work, but other times it
      looks like it fails and it looks like it returns failures if we send
      multiple large IOs sometimes. Also it looks like it can return 2 different
      errors. It will sometimes send iscsi reject errors indicating out of
      resources or it will send invalid cdb illegal requests check conditions.
      And then when it sends iscsi rejects it does not seem to handle retries
      when there are command sequence holes, so I could not just add code to
      try and gracefully handle that error code.
      
      The problem is that we do not have a good contact for the company,
      so we are not able to determine under what conditions it returns
      which error and why it sometimes works.
      
      So, this patch just adds a new black list flag to set targets like this to
      the old max safe sectors of 1024. The max_hw_sectors changes added in 3.19
      caused this regression, so I also ccing stable.
      Reported-by: NChristian Hesse <list@eworm.de>
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJames Bottomley <JBottomley@Odin.com>
      35e9a9f9
  12. 27 4月, 2015 2 次提交
  13. 26 4月, 2015 1 次提交
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c