1. 13 4月, 2016 1 次提交
    • N
      debugfs: prevent access to possibly dead file_operations at file open · 9fd4dcec
      Nicolai Stange 提交于
      Nothing prevents a dentry found by path lookup before a return of
      __debugfs_remove() to actually get opened after that return. Now, after
      the return of __debugfs_remove(), there are no guarantees whatsoever
      regarding the memory the corresponding inode's file_operations object
      had been kept in.
      
      Since __debugfs_remove() is seldomly invoked, usually from module exit
      handlers only, the race is hard to trigger and the impact is very low.
      
      A discussion of the problem outlined above as well as a suggested
      solution can be found in the (sub-)thread rooted at
      
        http://lkml.kernel.org/g/20130401203445.GA20862@ZenIV.linux.org.uk
        ("Yet another pipe related oops.")
      
      Basically, Greg KH suggests to introduce an intermediate fops and
      Al Viro points out that a pointer to the original ones may be stored in
      ->d_fsdata.
      
      Follow this line of reasoning:
      - Add SRCU as a reverse dependency of DEBUG_FS.
      - Introduce a srcu_struct object for the debugfs subsystem.
      - In debugfs_create_file(), store a pointer to the original
        file_operations object in ->d_fsdata.
      - Make debugfs_remove() and debugfs_remove_recursive() wait for a
        SRCU grace period after the dentry has been delete()'d and before they
        return to their callers.
      - Introduce an intermediate file_operations object named
        "debugfs_open_proxy_file_operations". It's ->open() functions checks,
        under the protection of a SRCU read lock, whether the dentry is still
        alive, i.e. has not been d_delete()'d and if so, tries to acquire a
        reference on the owning module.
        On success, it sets the file object's ->f_op to the original
        file_operations and forwards the ongoing open() call to the original
        ->open().
      - For clarity, rename the former debugfs_file_operations to
        debugfs_noop_file_operations -- they are in no way canonical.
      
      The choice of SRCU over "normal" RCU is justified by the fact, that the
      former may also be used to protect ->i_private data from going away
      during the execution of a file's readers and writers which may (and do)
      sleep.
      
      Finally, introduce the fs/debugfs/internal.h header containing some
      declarations internal to the debugfs implementation.
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9fd4dcec
  2. 30 3月, 2016 4 次提交
    • D
      fs: kernfs: Replace CURRENT_TIME by current_fs_time() · 3a3a5fec
      Deepa Dinamani 提交于
      This is in preparation for the series that transitions
      filesystem timestamps to use 64 bit time and hence make
      them y2038 safe.
      
      CURRENT_TIME macro will be deleted before merging the
      aforementioned series.
      
      Use current_fs_time() instead of CURRENT_TIME for inode
      timestamps.
      
      struct kernfs_node is associated with a sysfs file/ directory.
      Truncate the values to appropriate time granularity when
      writing to inode timestamps of the files.
      
      ktime_get_real_ts() is used to obtain times for
      struct kernfs_iattrs. Since these times are later assigned to
      inode times using timespec_truncate() for all filesystem based
      operations, we can save the supers list traversal time here by
      using ktime_get_real_ts() directly.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a3a5fec
    • D
      fs: debugfs: Replace CURRENT_TIME by current_fs_time() · 1b48b530
      Deepa Dinamani 提交于
      CURRENT_TIME macro is not appropriate for filesystems as it
      doesn't use the right granularity for filesystem timestamps.
      Use current_fs_time() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b48b530
    • R
      debugfs: fix inode i_nlink references for automount dentry · a8f324a4
      Roman Pen 提交于
      Directory inodes should start off with i_nlink == 2 (one extra ref
      for "." entry).  debugfs_create_automount() increases neither the
      i_nlink reference for current inode nor for parent inode.
      
      On attempt to remove the automount dentry, kernel complains:
      
        [   86.288070] WARNING: CPU: 1 PID: 3616 at fs/inode.c:273 drop_nlink+0x3e/0x50()
        [   86.288461] Modules linked in: debugfs_example2(O-)
        [   86.288745] CPU: 1 PID: 3616 Comm: rmmod Tainted: G           O    4.4.0-rc3-next-20151207+ #135
        [   86.289197] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150617_082717-anatol 04/01/2014
        [   86.289696]  ffffffff81be05c9 ffff8800b9e6fda0 ffffffff81352e2c 0000000000000000
        [   86.290110]  ffff8800b9e6fdd8 ffffffff81065142 ffff8801399175e8 ffff8800bb78b240
        [   86.290507]  ffff8801399175e8 ffff8800b73d7898 ffff8800b73d7840 ffff8800b9e6fde8
        [   86.290933] Call Trace:
        [   86.291080]  [<ffffffff81352e2c>] dump_stack+0x4e/0x82
        [   86.291340]  [<ffffffff81065142>] warn_slowpath_common+0x82/0xc0
        [   86.291640]  [<ffffffff8106523a>] warn_slowpath_null+0x1a/0x20
        [   86.291932]  [<ffffffff811ae62e>] drop_nlink+0x3e/0x50
        [   86.292208]  [<ffffffff811ba35b>] simple_unlink+0x4b/0x60
        [   86.292481]  [<ffffffff811ba3a7>] simple_rmdir+0x37/0x50
        [   86.292748]  [<ffffffff812d9808>] __debugfs_remove.part.16+0xa8/0xd0
        [   86.293082]  [<ffffffff812d9a0b>] debugfs_remove_recursive+0xdb/0x1c0
        [   86.293406]  [<ffffffffa00004dd>] cleanup_module+0x2d/0x3b [debugfs_example2]
        [   86.293762]  [<ffffffff810d959b>] SyS_delete_module+0x16b/0x220
        [   86.294077]  [<ffffffff818ef857>] entry_SYSCALL_64_fastpath+0x12/0x6a
        [   86.294405] ---[ end trace c9fc53353fe14a36 ]---
        [   86.294639] ------------[ cut here ]------------
      
      To reproduce the issue it is enough to invoke these lines:
      
           autom = debugfs_create_automount("automount", NULL, vfsmount_cb, data);
           BUG_ON(IS_ERR_OR_NULL(autom));
           debugfs_remove(autom);
      
      The issue is fixed by increasing inode i_nlink references for current
      and parent inodes.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8f324a4
    • L
      chrdev: emit a warning when we go below dynamic major range · 49db08c3
      Linus Walleij 提交于
      Currently a dynamically allocated character device major is taken
      from 254 and downward. This mechanism is used for RTC, IIO and a
      few other subsystems.
      
      The kernel currently has no check prevening these dynamic
      allocations from eating into the assigned numbers at 233 and
      downward.
      
      In a recent test it was reported that so many dynamic device
      majors were used on a test server, that the major number for
      infiniband (231) was stolen. This occurred when allocating a new
      major number for GPIO chips. The error messages from the kernel
      were not helpful. (See: https://lkml.org/lkml/2016/2/14/124)
      
      This patch adds a defined lower limit of the dynamic major
      allocation region will henceforth emit a warning if we start to
      eat into the assigned numbers. It does not do any semantic
      changes and will not change the kernels behaviour: numbers will
      still continue to be stolen, but we will know from dmesg what
      is going on.
      
      This also updates the Documentation/devices.txt to clearly
      reflect that we are using this range of major numbers for dynamic
      allocation.
      Reported-by: NYing Huang <ying.huang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49db08c3
  3. 27 3月, 2016 1 次提交
    • L
      f2fs/crypto: fix xts_tweak initialization · 02fc59a0
      Linus Torvalds 提交于
      Commit 0b81d077 ("fs crypto: move per-file encryption from f2fs
      tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
      renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
      to just "FS" for preprocessor symbols).
      
      Because of the symbol renaming, it's a bit hard to see it as a file
      move: use
      
          git show -M30 0b81d077
      
      to lower the rename detection to just 30% similarity and make git show
      the files as renamed (the header file won't be shown as a rename even
      then - since all it contains is symbol definitions, it looks almost
      completely different).
      
      Even with the renames showing as renames, the diffs are not all that
      easy to read, since so much is just the renames.  But Eric Biggers
      noticed that it's not just all renames: the initialization of the
      xts_tweak had been broken too, using the inode number rather than the
      page offset.
      
      That's not right - it makes the xfs_tweak the same for all pages of each
      inode.  It _might_ make sense to make the xfs_tweak contain both the
      offset _and_ the inode number, but not just the inode number.
      Reported-by: NEric Biggers <ebiggers3@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02fc59a0
  4. 26 3月, 2016 34 次提交
    • A
      orangefs: fix orangefs_superblock locking · 45996492
      Al Viro 提交于
      * switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
      * remove from the list _before_ orangefs_unmount() - request_mutex
      in the latter will make sure that nothing observed in the loop in
      ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
      of loop
      * on removal, keep the forward pointer and zero the back one.  That
      way we can drop and regain the spinlock in the loop body (again,
      ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
      rest of the list.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      45996492
    • A
      orangefs: fix do_readv_writev() handling of error halfway through · 6d4c1a30
      Al Viro 提交于
      Error should only be returned if nothing had been read/written.
      Otherwise we need to report a short read/write instead.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      6d4c1a30
    • A
      524b1d30
    • A
      orangefs: sanitize ->llseek() · 177f8fc4
      Al Viro 提交于
      a) open files can't have NULL inodes
      b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
      c) make_bad_inode() on lseek()?
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      177f8fc4
    • A
      orangefs-bufmap.h: trim unused junk · 7df240d7
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      7df240d7
    • A
      orangefs: saner calling conventions for getting a slot · b8a99a8f
      Al Viro 提交于
      just have it return the slot number or -E... - the caller checks
      the sign anyway
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      b8a99a8f
    • A
      orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer · bf6bf606
      Al Viro 提交于
      it's always __orangefs_bufmap
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      bf6bf606
    • A
      orangefs: get rid of readdir_handle_s · 9f5e2f7f
      Al Viro 提交于
      no point, really - we couldn't keep those across the calls of
      getdents(); it would be too easy to DoS, having all slots exhausted.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMike Marshall <hubcap@omnibond.com>
      9f5e2f7f
    • X
      ocfs2: extend enough credits for freeing one truncate record while replaying truncate records · 102c2595
      Xue jiufei 提交于
      Now function ocfs2_replay_truncate_records() first modifies tl_used,
      then calls ocfs2_extend_trans() to extend transactions for gd and alloc
      inode used for freeing clusters.  jbd2_journal_restart() may be called
      and it may happen that tl_used in truncate log is decreased but the
      clusters are not freed, which means these clusters are lost.  So we
      should avoid extending transactions in these two operations.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Acked-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      102c2595
    • X
      ocfs2: extend transaction for ocfs2_remove_rightmost_path() and... · 17215989
      Xue jiufei 提交于
      ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
      
      I found that jbd2_journal_restart() is called in some places without
      keeping things consistently before.  However, jbd2_journal_restart() may
      commit the handle's transaction and restart another one.  If the first
      transaction is committed successfully while another not, it may cause
      filesystem inconsistency or read only.  This is an effort to fix this
      kind of problems.
      
      This patch (of 3):
      
      The following functions will be called while truncating an extent:
      ocfs2_remove_btree_range
        -> ocfs2_start_trans
        -> ocfs2_remove_extent
           -> ocfs2_truncate_rec
             -> ocfs2_extend_rotate_transaction
               -> jbd2_journal_restart if jbd2_journal_extend fail
             -> ocfs2_rotate_tree_left
               -> ocfs2_remove_rightmost_path
                   -> ocfs2_extend_rotate_transaction
                     -> ocfs2_unlink_subtree
                      -> ocfs2_update_edge_lengths
                        -> ocfs2_extend_trans
                          -> jbd2_journal_restart if jbd2_journal_extend fail
        -> ocfs2_et_update_clusters
        -> ocfs2_commit_trans
      
      jbd2_journal_restart() may be called and it may happened that the buffers
      dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
      ocfs2_et_update_clusters() are not, the total clusters on extent tree and
      i_clusters in ocfs2_dinode is inconsistency.  So the clusters got from
      ocfs2_dinode is incorrect, and it also cause read-only problem when call
      ocfs2_commit_truncate() with the error message: "Inode %llu has empty
      extent block at %llu".
      
      We should extend enough credits for function ocfs2_remove_rightmost_path
      and ocfs2_update_edge_lengths to avoid this inconsistency.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Acked-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17215989
    • X
      ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert · e5054c9a
      xuejiufei 提交于
      We have found a bug when two nodes doing umount one after another.
      
      1) Node 1 migrate a lockres that has 3 locks in grant queue such as
         N2(PR)<->N3(NL)<->N4(PR) to N2.  After migration, lvb of the lock
         N3(NL) and N4(PR) are empty on node 2 because migration target do not
         copy lvb to these two lock.
      
      2) Node 3 want to convert to PR, it can be granted in
         __dlmconvert_master(), and the order of these locks is unchanged.  The
         lvb of the lock N3(PR) on node 2 is copyed from lockres in function
         dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
      
      3) Node 2 want to leave domain, it will migrate this lockres to node 3.
         Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
         when adding the lock N4(PR) to mres with the following message because
         the lvb of mres is already copied from lock N3(PR), but the lvb of lock
         N4(PR) is empty.
      
      "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Acked-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5054c9a
    • J
      ocfs2: solve a problem of crossing the boundary in updating backups · 584dca34
      jiangyiwen 提交于
      In update_backups() there exists a problem of crossing the boundary as
      follows:
      
      we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
      include 0~33554431 cluster, in update_backups func, it will backup super
      block in location of 1TB which is the 33554432th cluster, so the
      phenomenon of crossing the boundary happens.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      584dca34
    • J
      ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local · 35ddf78e
      jiangyiwen 提交于
      This patch fixes a deadlock, as follows:
      
        Node 1                Node 2                  Node 3
      1)volume a and b are    only mount vol a        only mount vol b
        mounted
      
      2)                      start to mount b        start to mount a
      
      3)                      check hb of Node 3      check hb of Node 2
                              in vol a, qs_holds++    in vol b, qs_holds++
      
      4) -------------------- all nodes' network down --------------------
      
      5)                      progress of mount b     the same situation as
                              failed, and then call   Node 2
                              ocfs2_dismount_volume.
                              but the process is hung,
                              since there is a work
                              in ocfs2_wq cannot beo
                              completed. This work is
                              about vol a, because
                              ocfs2_wq is global wq.
                              BTW, this work which is
                              scheduled in ocfs2_wq is
                              ocfs2_orphan_scan_work,
                              and the context in this work
                              needs to take inode lock
                              of orphan_dir, because
                              lockres owner are Node 1 and
                              all nodes' nework has been down
                              at the same time, so it can't
                              get the inode lock.
      
      6)                      Why can't this node be fenced
                              when network disconnected?
                              Because the process of
                              mount is hung what caused qs_holds
                              is not equal 0.
      
      Because all works in the ocfs2_wq are relative to the super block.
      
      The solution is to change the ocfs2_wq from global to local.  In other
      words, move it into struct ocfs2_super.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35ddf78e
    • J
      ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list · be12b299
      Joseph Qi 提交于
      When master handles convert request, it queues ast first and then
      returns status.  This may happen that the ast is sent before the request
      status because the above two messages are sent by two threads.  And
      right after the ast is sent, if master down, it may trigger BUG in
      dlm_move_lockres_to_recovery_list in the requested node because ast
      handler moves it to grant list without clear lock->convert_pending.  So
      remove BUG_ON statement and check if the ast is processed in
      dlmconvert_remote.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be12b299
    • J
      ocfs2/dlm: fix race between convert and recovery · ac7cf246
      Joseph Qi 提交于
      There is a race window between dlmconvert_remote and
      dlm_move_lockres_to_recovery_list, which will cause a lock with
      OCFS2_LOCK_BUSY in grant list, thus system hangs.
      
      dlmconvert_remote
      {
              spin_lock(&res->spinlock);
              list_move_tail(&lock->list, &res->converting);
              lock->convert_pending = 1;
              spin_unlock(&res->spinlock);
      
              status = dlm_send_remote_convert_request();
              >>>>>> race window, master has queued ast and return DLM_NORMAL,
                     and then down before sending ast.
                     this node detects master down and calls
                     dlm_move_lockres_to_recovery_list, which will revert the
                     lock to grant list.
                     Then OCFS2_LOCK_BUSY won't be cleared as new master won't
                     send ast any more because it thinks already be authorized.
      
              spin_lock(&res->spinlock);
              lock->convert_pending = 0;
              if (status != DLM_NORMAL)
                      dlm_revert_pending_convert(res, lock);
              spin_unlock(&res->spinlock);
      }
      
      In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
      (res is still in recovering) or res master changed (new master has
      finished recovery), reset the status to DLM_RECOVERING, then it will
      retry convert.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac7cf246
    • R
      ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write() · 28888681
      Ryan Ding 提交于
      The code should call ocfs2_free_alloc_context() to free meta_ac &
      data_ac before calling ocfs2_run_deallocs().  Because
      ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by
      meta_ac.  So try to release the lock before ocfs2_run_deallocs().
      
      Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Acked-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28888681
    • R
      ocfs2: fix disk file size and memory file size mismatch · ce170828
      Ryan Ding 提交于
      When doing append direct write in an already allocated cluster, and fast
      path in ocfs2_dio_get_block() is triggered, function
      ocfs2_dio_end_io_write() will be skipped as there is no context
      allocated.
      
      As a result, the disk file size will not be changed as it should be.
      The solution is to skip fast path when we are about to change file size.
      
      Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Acked-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce170828
    • R
      ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write · a86a72a4
      Ryan Ding 提交于
      Take ip_alloc_sem to prevent concurrent access to extent tree, which may
      cause the extent tree in an unstable state.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a86a72a4
    • R
      ocfs2: fix ip_unaligned_aio deadlock with dio work queue · e63890f3
      Ryan Ding 提交于
      In the current implementation of unaligned aio+dio, lock order behave as
      follow:
      
      in user process context:
        -> call io_submit()
          -> get i_mutex
      		<== window1
            -> get ip_unaligned_aio
              -> submit direct io to block device
          -> release i_mutex
        -> io_submit() return
      
      in dio work queue context(the work queue is created in __blockdev_direct_IO):
        -> release ip_unaligned_aio
      		<== window2
          -> get i_mutex
            -> clear unwritten flag & change i_size
          -> release i_mutex
      
      There is a limitation to the thread number of dio work queue.  256 at
      default.  If all 256 thread are in the above 'window2' stage, and there
      is a user process in the 'window1' stage, the system will became
      deadlock.  Since the user process hold i_mutex to wait ip_unaligned_aio
      lock, while there is a direct bio hold ip_unaligned_aio mutex who is
      waiting for a dio work queue thread to be schedule.  But all the dio
      work queue thread is waiting for i_mutex lock in 'window2'.
      
      This case only happened in a test which send a large number(more than
      256) of aio at one io_submit() call.
      
      My design is to remove ip_unaligned_aio lock.  Change it to a sync io
      instead.  Just like ip_unaligned_aio lock, serialize the unaligned aio
      dio.
      
      [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e63890f3
    • R
      ocfs2: code clean up for direct io · f1f973ff
      Ryan Ding 提交于
      Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
       * remove append dio check: it will be checked in ocfs2_direct_IO()
       * remove file hole check: file hole is supported for now
       * remove inline data check: it will be checked in ocfs2_direct_IO()
       * remove the full_coherence check when append dio: we will get the
         inode_lock in ocfs2_dio_get_block, there is no need to fall back to
         buffer io to ensure the coherence semantics.
      
      Now the drop dio procedure is gone.  :)
      
      [akpm@linux-foundation.org: remove unused label]
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1f973ff
    • R
      ocfs2: fix sparse file & data ordering issue in direct io · c15471f7
      Ryan Ding 提交于
      There are mainly three issues in the direct io code path after commit
      24c40b32 ("ocfs2: implement ocfs2_direct_IO_write"):
      
        * Does not support sparse file.
        * Does not support data ordering.  eg: when write to a file hole, it
          will alloc extent first.  If system crashed before io finished, data
          will corrupt.
        * Potential risk when doing aio+dio.  The -EIOCBQUEUED return value is
          likely to be ignored by ocfs2_direct_IO_write().
      
      To resolve above problems, re-design direct io code with following ideas:
        * Use buffer io to fill in holes.  And this will make better
          performance also.
        * Clear unwritten after direct write finished.  So we can make sure
          meta data changes after data write to disk.  (Unwritten extent is
          invisible to user, from user's view, meta data is not changed when
          allocate an unwritten extent.)
        * Clear ocfs2_direct_IO_write().  Do all ending work in end_io.
      
      This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
      test cases of ltp.
      
      For performance improvement, see following test result:
      ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
      The original way:
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
        16384+0 records in
        16384+0 records out
        4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
      
      After this patch:
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
        16384+0 records in
        16384+0 records out
        4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c15471f7
    • R
      ocfs2: record UNWRITTEN extents when populate write desc · 4506cfb6
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      There is still one issue in the direct write procedure.
      
      phase 1: alloc extent with UNWRITTEN flag
      phase 2: submit direct data to disk, add zero page to page cache
      phase 3: clear UNWRITTEN flag when data has been written to disk
      
      When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
      cluster 0~7KB (cluster size 8KB).  Write request A arrive phase 2 first,
      it will zero the region (4~7KB).  Before request A enter to phase 3,
      request B arrive phase 2, it will zero region (0~3KB).  This is just like
      request B steps request A.
      
      To resolve this issue, we should let request B knows this cluster is already
      under zero, to prevent it from steps the previous write request.
      
      This patch will add function ocfs2_unwritten_check() to do this job.  It
      will record all clusters that are under direct write(it will be recorded
      in the 'ip_unwritten_list' member of inode info), and prevent the later
      direct write writing to the same cluster to do the zero work again.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4506cfb6
    • R
      ocfs2: return the physical address in ocfs2_write_cluster · 2de6a3c7
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Direct io needs to get the physical address from write_begin, to map the
      user page.  This patch is to change the arg 'phys' of
      ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin.
      And we can retrieve it to the direct io procedure.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2de6a3c7
    • R
      ocfs2: do not change i_size in write_end for direct io · 46e62556
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Append direct io do not change i_size in get block phase.  It only move
      to orphan when starting write.  After data is written to disk, it will
      delete itself from orphan and update i_size.  So skip i_size change
      section in write_begin for direct io.
      
      And when there is no extents alloc, no meta data changes needed for
      direct io (since write_begin start trans for 2 reason: alloc extents &
      change i_size.  Now none of them needed).  So we can skip start trans
      procedure.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46e62556
    • R
      ocfs2: test target page before change it · 65c4db8c
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Direct io data will not appear in buffer.  The w_target_page member will
      not be filled by direct io.  So avoid to use it when it's NULL.  Unlinke
      buffer io and mmap, direct io will call write_begin with more than 1
      page a time.  So the target_index is not sufficient to describe the
      actual data.  change it to a range start at target_index, end in
      end_index.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65c4db8c
    • R
      ocfs2: use c_new to indicate newly allocated extents · b46637d5
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      There is a problem in ocfs2's direct io implement: if system crashed
      after extents allocated, and before data return, we will get a extent
      with dirty data on disk.  This problem violate the journal=order
      semantics, which means meta changes take effect after data written to
      disk.  To resolve this issue, direct write can use the UNWRITTEN flag to
      describe a extent during direct data writeback.  The direct write
      procedure should act in the following order:
      
      phase 1: alloc extent with UNWRITTEN flag
      phase 2: submit direct data to disk, add zero page to page cache
      phase 3: clear UNWRITTEN flag when data has been written to disk
      
      This patch is to change the 'c_unwritten' member of
      ocfs2_write_cluster_desc to 'c_clear_unwritten'.  Means whether to clear
      the unwritten flag.  It do not care if a extent is allocated or not.
      And use 'c_new' to specify a newly allocated extent.  So the direct io
      procedure can use c_clear_unwritten to control the UNWRITTEN bit on
      extent.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b46637d5
    • R
      ocfs2: add ocfs2_write_type_t type to identify the caller of write · c1ad1e3c
      Ryan Ding 提交于
      Patchset: fix ocfs2 direct io code patch to support sparse file and data
      ordering semantics
      
      The idea is to use buffer io(more precisely use the interface
      ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
      beyond block size.  And clear UNWRITTEN flag until direct io data has
      been written to disk, which can prevent data corruption when system
      crashed during direct write.
      
      And we will also archive a better performance: eg.  dd direct write new
      file with block size 4KB: before this patchset:
        2.5 MB/s
      after this patchset:
        66.4 MB/s
      
      This patch (of 8):
      
      To support direct io in ocfs2_write_begin_nolock &
      ocfs2_write_end_nolock.
      
      Remove unused args filp & flags.  Add new arg type.  The type is one of
      buffer/direct/mmap.  Indicate 3 way to perform write.  buffer/mmap type
      has implemented.  direct type will be implemented later.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1ad1e3c
    • J
      ocfs2: o2hb: fix double free bug · 9e13f1f9
      Junxiao Bi 提交于
      This is a regression issue and caused the following kernel panic when do
      ocfs2 multiple test.
      
        BUG: unable to handle kernel paging request at 00000002000800c0
        IP: [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160
        PGD 7bbe5067 PUD 0
        Oops: 0000 [#1] SMP
        Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
        CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1
        Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
        task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000
        RIP: 0010:[<ffffffff81192978>]  [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160
        RSP: 0018:ffff88007aed3a48  EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991
        RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330
        RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0
        R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00
        R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7
        FS:  0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0
        Call Trace:
          __d_alloc+0x2f/0x1a0
          d_alloc+0x17/0x80
          lookup_dcache+0x8a/0xc0
          path_openat+0x3c3/0x1210
          do_filp_open+0x80/0xe0
          do_sys_open+0x110/0x200
          SyS_open+0x19/0x20
          do_syscall_64+0x72/0x230
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
        RIP   kmem_cache_alloc+0x78/0x160
        CR2: 00000002000800c0
        ---[ end trace 823969e602e4aaac ]---
      
      Fixes: a4a1dfa4("ocfs2/cluster: fix memory leak in o2hb_region_release")
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e13f1f9
    • G
      ceph: use kmem_cache_zalloc · 99ec2697
      Geliang Tang 提交于
      Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      99ec2697
    • Y
      ceph: use lookup request to revalidate dentry · 200fd27c
      Yan, Zheng 提交于
      If dentry has no lease, ceph_d_revalidate() previously return 0.
      This causes VFS to invalidate the dentry and create a new dentry
      for later lookup. Invalidating a dentry also detach any underneath
      mount points. So mount point inside cephfs can disapear mystically
      (even the mount point is not modified by other hosts).
      
      The fix is using lookup request to revalidate dentry without lease.
      This can partly solve the mount points disapear issue (as long as
      the mount point is not modified by other hosts)
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      200fd27c
    • Y
      ceph: kill ceph_get_dentry_parent_inode() · 641235d8
      Yan, Zheng 提交于
      use vfs helper dget_parent() instead
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      641235d8
    • Y
      ceph: fix security xattr deadlock · 315f2408
      Yan, Zheng 提交于
      When security is enabled, security module can call filesystem's
      getxattr/setxattr callbacks during d_instantiate(). For cephfs,
      d_instantiate() is usually called by MDS' dispatch thread, while
      handling MDS reply. If the MDS reply does not include xattrs and
      corresponding caps, getxattr/setxattr need to send a new request
      to MDS and waits for the reply. This makes MDS' dispatch sleep,
      nobody handles later MDS replies.
      
      The fix is make sure lookup/atomic_open reply include xattrs and
      corresponding caps. So getxattr can be handled by cached xattrs.
      This requires some modification to both MDS and request message.
      (Client tells MDS what caps it wants; MDS encodes proper caps in
      the reply)
      
      Smack security module may call setxattr during d_instantiate().
      Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
      to us. So just make setxattr return error when called by MDS'
      dispatch thread.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      315f2408
    • Y
      ceph: don't request vxattrs from MDS · 29dccfa5
      Yan, Zheng 提交于
      It's uselese because MDS reply does not carry any vxattr.
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      29dccfa5
    • Y
      ceph: fix mounting same fs multiple times · 132ca7e1
      Yan, Zheng 提交于
      Now __ceph_open_session() only accepts closed client. An opened
      client will tigger BUG_ON().
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      132ca7e1