1. 16 7月, 2017 1 次提交
    • B
      fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks · 9d5b86ac
      Benjamin Coddington 提交于
      Since commit c69899a1 "NFSv4: Update of VFS byte range lock must be
      atomic with the stateid update", NFSv4 has been inserting locks in rpciod
      worker context.  The result is that the file_lock's fl_nspid is the
      kworker's pid instead of the original userspace pid.
      
      The fl_nspid is only used to represent the namespaced virtual pid number
      when displaying locks or returning from F_GETLK.  There's no reason to set
      it for every inserted lock, since we can usually just look it up from
      fl_pid.  So, instead of looking up and holding struct pid for every lock,
      let's just look up the virtual pid number from fl_pid when it is needed.
      That means we can remove fl_nspid entirely.
      
      The translaton and presentation of fl_pid should handle the following four
      cases:
      
      1 - F_GETLK on a remote file with a remote lock:
          In this case, the filesystem should determine the l_pid to return here.
          Filesystems should indicate that the fl_pid represents a non-local pid
          value that should not be translated by returning an fl_pid <= 0.
      
      2 - F_GETLK on a local file with a remote lock:
          This should be the l_pid of the lock manager process, and translated.
      
      3 - F_GETLK on a remote file with a local lock, and
      4 - F_GETLK on a local file with a local lock:
          These should be the translated l_pid of the local locking process.
      
      Fuse was already doing the correct thing by translating the pid into the
      caller's namespace.  With this change we must update fuse to translate
      to init's pid namespace, so that the locks API can then translate from
      init's pid namespace into the pid namespace of the caller.
      
      With this change, the locks API will expect that if a filesystem returns
      a remote pid as opposed to a local pid for F_GETLK, that remote pid will
      be <= 0.  This signifies that the pid is remote, and the locks API will
      forego translating that pid into the pid namespace of the local calling
      process.
      
      Finally, we convert remote filesystems to present remote pids using
      negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
      pid returned for F_GETLK lock requests.
      
      Since local pids will never be larger than PID_MAX_LIMIT (which is
      currently defined as <= 4 million), but pid_t is an unsigned int, we
      should have plenty of room to represent remote pids with negative
      numbers if we assume that remote pid numbers are similarly limited.
      
      If this is not the case, then we run the risk of having a remote pid
      returned for which there is also a corresponding local pid.  This is a
      problem we have now, but this patch should reduce the chances of that
      occurring, while also returning those remote pid numbers, for whatever
      that may be worth.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      9d5b86ac
  2. 21 4月, 2017 1 次提交
  3. 18 4月, 2017 2 次提交
  4. 25 2月, 2017 1 次提交
  5. 23 2月, 2017 3 次提交
  6. 15 11月, 2016 1 次提交
  7. 01 10月, 2016 2 次提交
  8. 22 9月, 2016 1 次提交
  9. 25 8月, 2016 1 次提交
  10. 29 7月, 2016 4 次提交
  11. 30 6月, 2016 1 次提交
    • A
      fuse: improve aio directIO write performance for size extending writes · 7879c4e5
      Ashish Sangwan 提交于
      While sending the blocking directIO in fuse, the write request is broken
      into sub-requests, each of default size 128k and all the requests are sent
      in non-blocking background mode if async_dio mode is supported by libfuse.
      The process which issue the write wait for the completion of all the
      sub-requests. Sending multiple requests parallely gives a chance to perform
      parallel writes in the user space fuse implementation if it is
      multi-threaded and hence improves the performance.
      
      When there is a size extending aio dio write, we switch to blocking mode so
      that we can properly update the size of the file after completion of the
      writes. However, in this situation all the sub-requests are sent in
      serialized manner where the next request is sent only after receiving the
      reply of the current request. Hence the multi-threaded user space
      implementation is not utilized properly.
      
      This patch changes the size extending aio dio behavior to exactly follow
      blocking dio. For multi threaded fuse implementation having 10 threads and
      using buffer size of 64MB to perform async directIO, we are getting double
      the speed.
      Signed-off-by: NAshish Sangwan <ashishsangwan2@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      7879c4e5
  12. 02 5月, 2016 2 次提交
  13. 25 4月, 2016 1 次提交
  14. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  15. 16 3月, 2016 1 次提交
  16. 14 3月, 2016 2 次提交
    • S
      fuse: Add reference counting for fuse_io_priv · 744742d6
      Seth Forshee 提交于
      The 'reqs' member of fuse_io_priv serves two purposes. First is to track
      the number of oustanding async requests to the server and to signal that
      the io request is completed. The second is to be a reference count on the
      structure to know when it can be freed.
      
      For sync io requests these purposes can be at odds.  fuse_direct_IO() wants
      to block until the request is done, and since the signal is sent when
      'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
      use the object after the userspace server has completed processing
      requests. This leads to some handshaking and special casing that it
      needlessly complicated and responsible for at least one race condition.
      
      It's much cleaner and safer to maintain a separate reference count for the
      object lifecycle and to let 'reqs' just be a count of outstanding requests
      to the userspace server. Then we can know for sure when it is safe to free
      the object without any handshaking or special cases.
      
      The catch here is that most of the time these objects are stack allocated
      and should not be freed. Initializing these objects with a single reference
      that is never released prevents accidental attempts to free the objects.
      
      Fixes: 9d5722b7 ("fuse: handle synchronous iocbs internally")
      Cc: stable@vger.kernel.org # v4.1+
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      744742d6
    • R
      fuse: do not use iocb after it may have been freed · 7cabc61e
      Robert Doebbelin 提交于
      There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an
      iocb that could have been freed if async io has already completed.  The fix
      in this case is simple and obvious: cache the result before starting io.
      
      It was discovered by KASan:
      
      kernel: ==================================================================
      kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390
      Signed-off-by: NRobert Doebbelin <robert@quobyte.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: bcba24cc ("fuse: enable asynchronous processing direct IO")
      Cc: <stable@vger.kernel.org> # 3.10+
      7cabc61e
  17. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  18. 10 11月, 2015 2 次提交
    • R
      fuse: add support for SEEK_HOLE and SEEK_DATA in lseek · 0b5da8db
      Ravishankar N 提交于
      A useful performance improvement for accessing virtual machine images
      via FUSE mount.
      
      See https://bugzilla.redhat.com/show_bug.cgi?id=1220173 for a use-case
      for glusterFS.
      Signed-off-by: NRavishankar N <ravishankar@redhat.com>
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      0b5da8db
    • R
      fuse: break infinite loop in fuse_fill_write_pages() · 3ca8138f
      Roman Gushchin 提交于
      I got a report about unkillable task eating CPU. Further
      investigation shows, that the problem is in the fuse_fill_write_pages()
      function. If iov's first segment has zero length, we get an infinite
      loop, because we never reach iov_iter_advance() call.
      
      Fix this by calling iov_iter_advance() before repeating an attempt to
      copy data from userspace.
      
      A similar problem is described in 124d3b70 ("fix writev regression:
      pan hanging unkillable and un-straceable"). If zero-length segmend
      is followed by segment with invalid address,
      iov_iter_fault_in_readable() checks only first segment (zero-length),
      iov_iter_copy_from_user_atomic() skips it, fails at second and
      returns zero -> goto again without skipping zero-length segment.
      
      Patch calls iov_iter_advance() before goto again: we'll skip zero-length
      segment at second iteraction and iov_iter_fault_in_readable() will detect
      invalid address.
      
      Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
      description.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Maxim Patlasov <mpatlasov@parallels.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Fixes: ea9b9907 ("fuse: implement perform_write")
      Cc: <stable@vger.kernel.org>
      3ca8138f
  19. 23 10月, 2015 1 次提交
  20. 01 7月, 2015 3 次提交
  21. 24 6月, 2015 1 次提交
  22. 02 6月, 2015 1 次提交
    • T
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93f78d88
  23. 12 4月, 2015 6 次提交