1. 07 9月, 2013 1 次提交
  2. 28 8月, 2013 5 次提交
    • S
      ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem · 7d6e1f54
      Sha Zhengju 提交于
      Following we will begin to add memcg dirty page accounting around
      __set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
      avoid exporting those details to filesystems.
      
      Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
      codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
      Thanks very much for Sage and Yan Zheng's coaching!
      
      I tested it in a two server's ceph environment that one is client and the other is
      mds/osd/mon, and run the following fsx test from xfstests:
      
        ./fsx   1MB -N 50000 -p 10000 -l 1048576
        ./fsx  10MB -N 50000 -p 10000 -l 10485760
        ./fsx 100MB -N 50000 -p 10000 -l 104857600
      
      The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
      successfully without triggering any of WARN_ON.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      7d6e1f54
    • M
      ceph: allow sync_read/write return partial successed size of read/write. · ee7289bf
      majianpeng 提交于
      For sync_read/write, it may do multi stripe operations.If one of those
      met erro, we return the former successed size rather than a error value.
      There is a exception for write-operation met -EOLDSNAPC.If this occur,we
      retry the whole write again.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      ee7289bf
    • M
      ceph: fix bugs about handling short-read for sync read mode. · 02ae66d8
      majianpeng 提交于
      cephfs . show_layout
      >layyout.data_pool:     0
      >layout.object_size:   4194304
      >layout.stripe_unit:   4194304
      >layout.stripe_count:  1
      
      TestA:
      >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
      >dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
      >dd if=test of=/dev/null bs=6M count=1 iflag=direct
      The messages from func striped_read are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
      ceph:           file.c:350  : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT
      ceph:           file.c:381  : zero tail 4194304
      ceph:           file.c:390  : striped_read returns 6291456
      The hole of file is from 2M--4M.But actualy it zero the last 4M include
      the last 2M area which isn't a hole.
      Using this patch, the messages are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
      ceph:           file.c:358  :  zero gap 2097152 to 4194304
      ceph:           file.c:350  : striped_read 4194304~2097152 (read 4194304) got 2097152
      ceph:           file.c:384  : striped_read returns 6291456
      
      TestB:
      >echo majianpeng > test
      >dd if=test of=/dev/null bs=2M count=1 iflag=direct
      The messages are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
      ceph:           file.c:350  : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT
      ceph:           file.c:390  : striped_read returns 11
      For this case,it did once more striped_read.It's no meaningless.
      Using this patch, the message are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
      ceph:           file.c:384  : striped_read returns 11
      
      Big thanks to Yan Zheng for the patch.
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      02ae66d8
    • L
      ceph: remove useless variable revoked_rdcache · e9075743
      Li Wang 提交于
      Cleanup in handle_cap_grant().
      Signed-off-by: NLi Wang <liwang@ubuntukylin.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      e9075743
    • S
      ceph: fix fallocate division · b314a90d
      Sage Weil 提交于
      We need to use do_div to divide by a 64-bit value.
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      b314a90d
  3. 16 8月, 2013 4 次提交
    • L
      ceph: punch hole support · ad7a60de
      Li Wang 提交于
      This patch implements fallocate and punch hole support for Ceph kernel client.
      Signed-off-by: NLi Wang <liwang@ubuntukylin.com>
      Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
      ad7a60de
    • Y
      ceph: fix request max size · 3871cbb9
      Yan, Zheng 提交于
      ceph_check_caps() requests new max size only when there is Fw cap.
      If we call check_max_size() while there is no Fw cap. It updates
      i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps()
      does nothing. Later when Fw cap is issued, we call check_max_size()
      again. But i_wanted_max_size is equal to 'endoff' at this time, so
      check_max_size() doesn't call ceph_check_caps() and we end up with
      waiting for the new max size forever.
      
      The fix is duplicate ceph_check_caps()'s "request max size" code in
      check_max_size(), and make try_get_cap_refs() wait for the Fw cap
      before retry requesting new max size.
      
      This patch also removes the "endoff > (inode->i_size << 1)" check
      in check_max_size(). It's useless because there is no corresponding
      logic in ceph_check_caps().
      Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      3871cbb9
    • Y
      ceph: introduce i_truncate_mutex · b0d7c223
      Yan, Zheng 提交于
      I encountered below deadlock when running fsstress
      
      wmtruncate work      truncate                 MDS
      ---------------  ------------------  --------------------------
                         lock i_mutex
                                            <- truncate file
      lock i_mutex (blocked)
                                            <- revoking Fcb (filelock to MIX)
                         send request ->
                                               handle request (xlock filelock)
      
      At the initial time, there are some dirty pages in the page cache.
      When the kclient receives the truncate message, it reduces inode size
      and creates some 'out of i_size' dirty pages. wmtruncate work can't
      truncate these dirty pages because it's blocked by the i_mutex. Later
      when the kclient receives the cap message that revokes Fcb caps, It
      can't flush all dirty pages because writepages() only flushes dirty
      pages within the inode size.
      
      When the MDS handles the 'truncate' request from kclient, it waits
      for the filelock to become stable. But the filelock is stuck in
      unstable state because it can't finish revoking kclient's Fcb caps.
      
      The truncate pagecache locking has already caused lots of trouble
      for use. I think it's time simplify it by introducing a new mutex.
      We use the new mutex to prevent concurrent truncate_inode_pages().
      There is no need to worry about race between buffered write and
      truncate_inode_pages(), because our "get caps" mechanism prevents
      them from concurrent execution.
      Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      b0d7c223
    • M
      ceph: cleanup the logic in ceph_invalidatepage · b150f5c1
      Milosz Tanski 提交于
      The invalidatepage code bails if it encounters a non-zero page offset. The
      current logic that does is non-obvious with multiple if statements.
      
      This should be logically and functionally equivalent.
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      b150f5c1
  4. 10 8月, 2013 12 次提交
  5. 05 7月, 2013 1 次提交
  6. 04 7月, 2013 15 次提交
    • Y
      ceph: fix race between cap issue and revoke · 6ee6b953
      Yan, Zheng 提交于
      If we receive new caps from the auth MDS and the non-auth MDS is
      revoking the newly issued caps, we should release the caps from
      the non-auth MDS. The scenario is filelock's state changes from
      SYNC to LOCK. Non-auth MDS revokes Fc cap, the client gets Fc cap
      from the auth MDS at the same time.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      6ee6b953
    • Y
      ceph: fix cap revoke race · b1530f57
      Yan, Zheng 提交于
      If caps are been revoking by the auth MDS, don't consider them as
      issued even they are still issued by non-auth MDS. The non-auth
      MDS should also be revoking/exporting these caps, the client just
      hasn't received the cap revoke/export message.
      
      The race I encountered is: When caps are exporting to new MDS, the
      client receives cap import message and cap revoke message from the
      new MDS, then receives cap export message from the old MDS. When
      the client receives cap revoke message from the new MDS, the revoking
      caps are still issued by the old MDS, so the client does nothing.
      Later when the cap export message is received, the client removes
      the caps issued by the old MDS. (Another way to fix the race is
      calling ceph_check_caps() in handle_cap_export())
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      b1530f57
    • Y
      ceph: fix pending vmtruncate race · b415bf4f
      Yan, Zheng 提交于
      The locking order for pending vmtruncate is wrong, it can lead to
      following race:
      
              write                  wmtruncate work
      ------------------------    ----------------------
      lock i_mutex
      check i_truncate_pending   check i_truncate_pending
      truncate_inode_pages()     lock i_mutex (blocked)
      copy data to page cache
      unlock i_mutex
                                 truncate_inode_pages()
      
      The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()
      
      Fixes: http://tracker.ceph.com/issues/5453Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      b415bf4f
    • S
      ceph: avoid accessing invalid memory · 54464296
      Sasha Levin 提交于
      when mounting ceph with a dev name that starts with a slash, ceph
      would attempt to access the character before that slash. Since we
      don't actually own that byte of memory, we would trigger an
      invalid access:
      
      [   43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff
      [   43.500984] IP: [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
      [   43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060
      [   43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [   43.503006] Dumping ftrace buffer:
      [   43.503596]    (ftrace buffer empty)
      [   43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G        W    3.10.0-sasha #1129
      [   43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000
      [   43.505608] RIP: 0010:[<ffffffff818f3884>]  [<ffffffff818f3884>] parse_mount_options$
      [   43.506552] RSP: 0018:ffff880fa3413d08  EFLAGS: 00010286
      [   43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000
      [   43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000
      [   43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0
      [   43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0
      [   43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90
      [   43.509792] FS:  00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$
      [   43.509792] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0
      [   43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   43.509792] Stack:
      [   43.509792]  0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98
      [   43.509792]  0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000
      [   43.509792]  ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72
      [   43.509792] Call Trace:
      [   43.509792]  [<ffffffff818f3c72>] ceph_mount+0xa2/0x390
      [   43.509792]  [<ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0
      [   43.509792]  [<ffffffff81282f8d>] mount_fs+0x8d/0x1a0
      [   43.509792]  [<ffffffff812263d0>] ? __alloc_percpu+0x10/0x20
      [   43.509792]  [<ffffffff8129f799>] vfs_kern_mount+0x79/0x100
      [   43.509792]  [<ffffffff812a224d>] do_new_mount+0xcd/0x1c0
      [   43.509792]  [<ffffffff812a2e8d>] do_mount+0x15d/0x210
      [   43.509792]  [<ffffffff81220e55>] ? strndup_user+0x45/0x60
      [   43.509792]  [<ffffffff812a2fdd>] SyS_mount+0x9d/0xe0
      [   43.509792]  [<ffffffff83fd816c>] tracesys+0xdd/0xe2
      [   43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $
      [   43.509792] RIP  [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300
      [   43.509792]  RSP <ffff880fa3413d08>
      [   43.509792] CR2: ffff880fa3a97fff
      [   43.509792] ---[ end trace 22469cd81e93af51 ]---
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NSage Weil <sage@inktan.com>
      54464296
    • M
      ceph: Reconstruct the func ceph_reserve_caps. · 93faca6e
      majianpeng 提交于
      Drop ignored return value.  Fix allocation failure case to not leak.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      93faca6e
    • M
      ceph: Free mdsc if alloc mdsc->mdsmap failed. · fb3101b6
      majianpeng 提交于
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      fb3101b6
    • J
      ceph: remove sb_start/end_write in ceph_aio_write. · 0405a149
      Jianpeng Ma 提交于
      Either in vfs_write or io_submit,it call file_start/end_write.
      The different between file_start/end_write and sb_start/end_write is
      file_ only handle regular file.But i think in ceph_aio_write,it only
      for regular file.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Acked-by: NYan, Zheng <zheng.z.yan@intel.com>
      0405a149
    • M
    • M
      ceph: fix sleeping function called from invalid context. · a1dc1937
      majianpeng 提交于
      [ 1121.231883] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
      [ 1121.231935] in_atomic(): 1, irqs_disabled(): 0, pid: 9831, name: mv
      [ 1121.231971] 1 lock held by mv/9831:
      [ 1121.231973]  #0:  (&(&ci->i_ceph_lock)->rlock){+.+...},at:[<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph]
      [ 1121.231998] CPU: 3 PID: 9831 Comm: mv Not tainted 3.10.0-rc6+ #215
      [ 1121.232000] Hardware name: To Be Filled By O.E.M. To Be Filled By
      O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [ 1121.232027]  ffff88006d355a80 ffff880092f69ce0 ffffffff8168348c ffff880092f69cf8
      [ 1121.232045]  ffffffff81070435 ffff88006d355a20 ffff880092f69d20 ffffffff816899ba
      [ 1121.232052]  0000000300000004 ffff8800b76911d0 ffff88006d355a20 ffff880092f69d68
      [ 1121.232056] Call Trace:
      [ 1121.232062]  [<ffffffff8168348c>] dump_stack+0x19/0x1b
      [ 1121.232067]  [<ffffffff81070435>] __might_sleep+0xe5/0x110
      [ 1121.232071]  [<ffffffff816899ba>] down_read+0x2a/0x98
      [ 1121.232080]  [<ffffffffa02baf70>] ceph_vxattrcb_layout+0x60/0xf0 [ceph]
      [ 1121.232088]  [<ffffffffa02bbd7f>] ceph_getxattr+0x9f/0x1d0 [ceph]
      [ 1121.232093]  [<ffffffff81188d28>] vfs_getxattr+0xa8/0xd0
      [ 1121.232097]  [<ffffffff8118900b>] getxattr+0xab/0x1c0
      [ 1121.232100]  [<ffffffff811704f2>] ? final_putname+0x22/0x50
      [ 1121.232104]  [<ffffffff81155f80>] ? kmem_cache_free+0xb0/0x260
      [ 1121.232107]  [<ffffffff811704f2>] ? final_putname+0x22/0x50
      [ 1121.232110]  [<ffffffff8109e63d>] ? trace_hardirqs_on+0xd/0x10
      [ 1121.232114]  [<ffffffff816957a7>] ? sysret_check+0x1b/0x56
      [ 1121.232120]  [<ffffffff81189c9c>] SyS_fgetxattr+0x6c/0xc0
      [ 1121.232125]  [<ffffffff81695782>] system_call_fastpath+0x16/0x1b
      [ 1121.232129] BUG: scheduling while atomic: mv/9831/0x10000002
      [ 1121.232154] 1 lock held by mv/9831:
      [ 1121.232156]  #0:  (&(&ci->i_ceph_lock)->rlock){+.+...}, at:
      [<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph]
      
      I think move the ci->i_ceph_lock down is safe because we can't free
      ceph_inode_info at there.
      
      CC: stable@vger.kernel.org  # 3.8+
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      a1dc1937
    • Y
      005c4697
    • Y
      ceph: clear migrate seq when MDS restarts · 667ca05c
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      667ca05c
    • Y
      ceph: check migrate seq before changing auth cap · b8c2f3ae
      Yan, Zheng 提交于
      We may receive old request reply from the exporter MDS after receiving
      the importer MDS' cap import message.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      b8c2f3ae
    • Y
      ceph: fix race between page writeback and truncate · fc2744aa
      Yan, Zheng 提交于
      The client can receive truncate request from MDS at any time.
      So the page writeback code need to get i_size, truncate_seq and
      truncate_size atomically
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      fc2744aa
    • Y
      3803da49
    • Y
      ceph: fix cap release race · bb137f84
      Yan, Zheng 提交于
      ceph_encode_inode_release() can race with ceph_open() and release
      caps wanted by open files. So it should call __ceph_caps_wanted()
      to get the wanted caps.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      bb137f84
  7. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  8. 02 7月, 2013 1 次提交