1. 01 2月, 2018 2 次提交
    • G
      ocfs2: add trimfs dlm lock resource · 4882abeb
      Gang He 提交于
      Introduce a new dlm lock resource, which will be used to communicate
      during fstrimming of an ocfs2 device from cluster nodes.
      
      Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4882abeb
    • G
      ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE · ff26cc10
      Gang He 提交于
      If we can't get inode lock immediately in the function
      ocfs2_inode_lock_with_page() when reading a page, we should not return
      directly here, since this will lead to a softlockup problem when the
      kernel is configured with CONFIG_PREEMPT is not set.  The method is to
      get a blocking lock and immediately unlock before returning, this can
      avoid CPU resource waste due to lots of retries, and benefits fairness
      in getting lock among multiple nodes, increase efficiency in case
      modifying the same file frequently from multiple nodes.
      
      The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
      looks like:
      
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          <IRQ>
          dump_stack+0x5c/0x82
          panic+0xd5/0x21e
          watchdog_timer_fn+0x208/0x210
          __hrtimer_run_queues+0xcc/0x200
          hrtimer_interrupt+0xa6/0x1f0
          smp_apic_timer_interrupt+0x34/0x50
          apic_timer_interrupt+0x96/0xa0
          </IRQ>
         RIP: 0010:unlock_page+0x17/0x30
         RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
         RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
         RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
         RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
         R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
         R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
          ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
          ocfs2_readpage+0x41/0x2d0 [ocfs2]
          filemap_fault+0x12b/0x5c0
          ocfs2_fault+0x29/0xb0 [ocfs2]
          __do_fault+0x1a/0xa0
          __handle_mm_fault+0xbe8/0x1090
          handle_mm_fault+0xaa/0x1f0
          __do_page_fault+0x235/0x4b0
          trace_do_page_fault+0x3c/0x110
          async_page_fault+0x28/0x30
         RIP: 0033:0x7fa75ded638e
         RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
         RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
         RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
         RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
         R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
         R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
      
      About performance improvement, we can see the testing time is reduced,
      and CPU utilization decreases, the detailed data is as follows.  I ran
      multi_mmap test case in ocfs2-test package in a three nodes cluster.
      
      Before applying this patch:
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
         1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
            5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
           95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
         2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
         2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
         2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 14:44:52 CST 2017
        multi_mmap..................................................Passed.
        Runtime 783 seconds.
      
      After apply this patch:
      
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
          155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
           95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
         2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
            5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
         2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
          299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
          335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
          535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
         1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 15:04:12 CST 2017
        multi_mmap..................................................Passed.
        Runtime 487 seconds.
      
      Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
      Fixes: 1cce4df0 ("ocfs2: do not lock/unlock() inode DLM lock")
      Signed-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NEric Ren <zren@suse.com>
      Acked-by: Nalex chen <alex.chen@huawei.com>
      Acked-by: Npiaojun <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff26cc10
  2. 24 6月, 2017 1 次提交
    • E
      ocfs2: fix deadlock caused by recursive locking in xattr · 8818efaa
      Eric Ren 提交于
      Another deadlock path caused by recursive locking is reported.  This
      kind of issue was introduced since commit 743b5f14 ("ocfs2: take
      inode lock in ocfs2_iop_set/get_acl()").  Two deadlock paths have been
      fixed by commit b891fa50 ("ocfs2: fix deadlock issue when taking
      inode lock at vfs entry points").  Yes, we intend to fix this kind of
      case in incremental way, because it's hard to find out all possible
      paths at once.
      
      This one can be reproduced like this.  On node1, cp a large file from
      home directory to ocfs2 mountpoint.  While on node2, run
      setfacl/getfacl.  Both nodes will hang up there.  The backtraces:
      
      On node1:
        __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
        ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
        ocfs2_write_begin+0x43/0x1a0 [ocfs2]
        generic_perform_write+0xa9/0x180
        __generic_file_write_iter+0x1aa/0x1d0
        ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
        __vfs_write+0xc3/0x130
        vfs_write+0xb1/0x1a0
        SyS_write+0x46/0xa0
      
      On node2:
        __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
        ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
        ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
        ocfs2_set_acl+0x22d/0x260 [ocfs2]
        ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
        set_posix_acl+0x75/0xb0
        posix_acl_xattr_set+0x49/0xa0
        __vfs_setxattr+0x69/0x80
        __vfs_setxattr_noperm+0x72/0x1a0
        vfs_setxattr+0xa7/0xb0
        setxattr+0x12d/0x190
        path_setxattr+0x9f/0xb0
        SyS_setxattr+0x14/0x20
      
      Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
      exported by commit 439a36b8 ("ocfs2/dlmglue: prepare tracking logic
      to avoid recursive cluster lock").
      
      Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
      Fixes: 743b5f14 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      Signed-off-by: NEric Ren <zren@suse.com>
      Reported-by: NThomas Voegtle <tv@lio96.de>
      Tested-by: NThomas Voegtle <tv@lio96.de>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8818efaa
  3. 02 3月, 2017 1 次提交
  4. 23 2月, 2017 1 次提交
    • E
      ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock · 439a36b8
      Eric Ren 提交于
      We are in the situation that we have to avoid recursive cluster locking,
      but there is no way to check if a cluster lock has been taken by a precess
      already.
      
      Mostly, we can avoid recursive locking by writing code carefully.
      However, we found that it's very hard to handle the routines that are
      invoked directly by vfs code.  For instance:
      
        const struct inode_operations ocfs2_file_iops = {
            .permission     = ocfs2_permission,
            .get_acl        = ocfs2_iop_get_acl,
            .set_acl        = ocfs2_iop_set_acl,
        };
      
      Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
      
        do_sys_open
         may_open
          inode_permission
           ocfs2_permission
            ocfs2_inode_lock() <=== first time
             generic_permission
              get_acl
               ocfs2_iop_get_acl
        	ocfs2_inode_lock() <=== recursive one
      
      A deadlock will occur if a remote EX request comes in between two of
      ocfs2_inode_lock().  Briefly describe how the deadlock is formed:
      
      On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
      BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
      the remote EX lock request.  Another hand, the recursive cluster lock
      (the second one) will be blocked in in __ocfs2_cluster_lock() because of
      OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
      there is no chance for the first cluster lock on this node to be
      unlocked - we block ourselves in the code path.
      
      The idea to fix this issue is mostly taken from gfs2 code.
      
      1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
         of the processes' pid who has taken the cluster lock of this lock
         resource;
      
      2. introduce a new flag for ocfs2_inode_lock_full:
         OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
         us if we've got cluster lock.
      
      3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
         got the cluster lock in the upper code path.
      
      The tracking logic should be used by some of the ocfs2 vfs's callbacks,
      to solve the recursive locking issue cuased by the fact that vfs
      routines can call into each other.
      
      The performance penalty of processing the holder list should only be
      seen at a few cases where the tracking logic is used, such as get/set
      acl.
      
      You may ask what if the first time we got a PR lock, and the second time
      we want a EX lock? fortunately, this case never happens in the real
      world, as far as I can see, including permission check,
      (get|set)_(acl|attr), and the gfs2 code also do so.
      
      [sfr@canb.auug.org.au remove some inlines]
      Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      439a36b8
  5. 11 1月, 2017 1 次提交
    • E
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · e7ee2c08
      Eric Ren 提交于
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7ee2c08
  6. 27 7月, 2016 2 次提交
  7. 31 3月, 2016 1 次提交
    • A
      posix_acl: Inode acl caching fixes · b8a7a3a6
      Andreas Gruenbacher 提交于
      When get_acl() is called for an inode whose ACL is not cached yet, the
      get_acl inode operation is called to fetch the ACL from the filesystem.
      The inode operation is responsible for updating the cached acl with
      set_cached_acl().  This is done without locking at the VFS level, so
      another task can call set_cached_acl() or forget_cached_acl() before the
      get_acl inode operation gets to calling set_cached_acl(), and then
      get_acl's call to set_cached_acl() results in caching an outdate ACL.
      
      Prevent this from happening by setting the cached ACL pointer to a
      task-specific sentinel value before calling the get_acl inode operation.
      Move the responsibility for updating the cached ACL from the get_acl
      inode operations to get_acl().  There, only set the cached ACL if the
      sentinel value hasn't changed.
      
      The sentinel values are chosen to have odd values.  Likewise, the value
      of ACL_NOT_CACHED is odd.  In contrast, ACL object pointers always have
      an even value (ACLs are aligned in memory).  This allows to distinguish
      uncached ACLs values from ACL objects.
      
      In addition, switch from guarding inode->i_acl and inode->i_default_acl
      upates by the inode->i_lock spinlock to using xchg() and cmpxchg().
      
      Filesystems that do not want ACLs returned from their get_acl inode
      operations to be cached must call forget_cached_acl() to prevent the VFS
      from doing so.
      
      (Patch written by Al Viro and Andreas Gruenbacher.)
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8a7a3a6
  8. 22 1月, 2016 1 次提交
    • T
      ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock · b1b1e15e
      Tariq Saeed 提交于
      NFS on a 2 node ocfs2 cluster each node exporting dir.  The lock causing
      the hang is the global bit map inode lock.  Node 1 is master, has the
      lock granted in PR mode; Node 2 is in the converting list (PR -> EX).
      There are no holders of the lock on the master node so it should
      downconvert to NL and grant EX to node 2 but that does not happen.
      BLOCKED + QUEUED in lock res are set and it is on osb blocked list.
      Threads are waiting in __ocfs2_cluster_lock on BLOCKED.  One thread
      wants EX, rest want PR.  So it is as though the downconvert thread needs
      to be kicked to complete the conv.
      
      The hang is caused by an EX req coming into __ocfs2_cluster_lock on the
      heels of a PR req after it sets BUSY (drops l_lock, releasing EX
      thread), forcing the incoming EX to wait on BUSY without doing anything.
      PR has called ocfs2_dlm_lock, which sets the node 1 lock from NL -> PR,
      queues ast.
      
      At this time, upconvert (PR ->EX) arrives from node 2, finds conflict
      with node 1 lock in PR, so the lock res is put on dlm thread's dirty
      listt.
      
      After ret from ocf2_dlm_lock, PR thread now waits behind EX on BUSY till
      awoken by ast.
      
      Now it is dlm_thread that serially runs dlm_shuffle_lists, ast, bast, in
      that order.  dlm_shuffle_lists ques a bast on behalf of node 2 (which
      will be run by dlm_thread right after the ast).  ast does its part, sets
      UPCONVERT_FINISHING, clears BUSY and wakes its waiters.  Next,
      dlm_thread runs bast.  It sets BLOCKED and kicks dc thread.  dc thread
      runs ocfs2_unblock_lock, but since UPCONVERT_FINISHING set, skips doing
      anything and reques.
      
      Inside of __ocfs2_cluster_lock, since EX has been waiting on BUSY ahead
      of PR, it wakes up first, finds BLOCKED set and skips doing anything but
      clearing UPCONVERT_FINISHING (which was actually "meant" for the PR
      thread), and this time waits on BLOCKED.  Next, the PR thread comes out
      of wait but since UPCONVERT_FINISHING is not set, it skips updating the
      l_ro_holders and goes straight to wait on BLOCKED.  So there, we have a
      hang! Threads in __ocfs2_cluster_lock wait on BLOCKED, lock res in osb
      blocked list.  Only when dc thread is awoken, it will run
      ocfs2_unblock_lock and things will unhang.
      
      One way to fix this is to wake the dc thread on the flag after clearing
      UPCONVERT_FINISHING
      
      Orabug: 20933419
      Signed-off-by: NTariq Saeed <tariq.x.saeed@oracle.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Eric Ren <zren@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1b1e15e
  9. 15 1月, 2016 1 次提交
  10. 06 11月, 2015 1 次提交
  11. 05 9月, 2015 1 次提交
  12. 07 8月, 2015 1 次提交
    • J
      ocfs2: fix BUG in ocfs2_downconvert_thread_do_work() · 209f7512
      Joseph Qi 提交于
      The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
      ocfs2_downconvert_thread_do_work can be triggered in the following case:
      
      ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
      processed, and then processes the dentry lockres.  During the dentry
      put, it calls iput and then deletes rw, inode and open lockres from
      blocked list in ocfs2_mark_lockres_freeing.  And this causes the
      variable `processed' to not reflect the number of blocked lockres to be
      processed, which triggers the BUG.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      209f7512
  13. 22 4月, 2015 1 次提交
    • L
      Revert "ocfs2: incorrect check for debugfs returns" · 8f443e23
      Linus Torvalds 提交于
      This reverts commit e2ac55b6.
      
      Huang Ying reports that this causes a hang at boot with debugfs disabled.
      
      It is true that the debugfs error checks are kind of confusing, and this
      code certainly merits more cleanup and thinking about it, but there's
      something wrong with the trivial "check not just for NULL, but for error
      pointers too" patch.
      
      Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir
      pointer variable to an error pointer (-ENODEV), and then continue as if
      everything was fine.  But since debugfs is disabled, all the _users_ of
      that pointer end up being compiled away, so even though the pointer can
      not be dereferenced, that's still fine.
      
      So it's confusing and somewhat questionable, but the "more correct"
      error checks end up causing more trouble than they fix.
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NChengyu Song <csong84@gatech.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f443e23
  14. 15 4月, 2015 2 次提交
    • A
      ocfs2: check if the ocfs2 lock resource has been initialized before calling ocfs2_dlm_lock · 2f2eca20
      alex chen 提交于
      If ocfs2 lockres has not been initialized before calling ocfs2_dlm_lock,
      the lock won't be dropped and then will lead umount hung.  The case is
      described below:
      
      ocfs2_mknod
          ocfs2_mknod_locked
              __ocfs2_mknod_locked
                  ocfs2_journal_access_di
                  Failed because of -ENOMEM or other reasons, the inode lockres
                  has not been initialized yet.
      
          iput(inode)
              ocfs2_evict_inode
                  ocfs2_delete_inode
                      ocfs2_inode_lock
                          ocfs2_inode_lock_full_nested
                              __ocfs2_cluster_lock
                              Succeeds and allocates a new dlm lockres.
                  ocfs2_clear_inode
                      ocfs2_open_unlock
                          ocfs2_drop_inode_locks
                              ocfs2_drop_lock
                              Since lockres has not been initialized, the lock
                              can't be dropped and the lockres can't be
                              migrated, thus umount will hang forever.
      Signed-off-by: NAlex Chen <alex.chen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: Njoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f2eca20
    • C
      ocfs2: incorrect check for debugfs returns · e2ac55b6
      Chengyu Song 提交于
      debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
      is not configured, so the return value should be checked against
      ERROR_VALUE as well, otherwise the later dereference of the dentry pointer
      would crash the kernel.
      
      This patch tries to solve this problem by fixing certain checks. However,
      I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS.
      In current implementation, if CONFIG_DEBUG_FS is defined, then the above
      two functions will never return any ERROR_VALUE. So another possibility
      to fix this is to surround all the buggy checks/functions with the same
      #ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality,
      as only OCFS2_FS_STATS declares dependency on DEBUG_FS.
      Signed-off-by: NChengyu Song <csong84@gatech.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2ac55b6
  15. 11 2月, 2015 1 次提交
    • A
      ocfs2: prune the dcache before deleting the dentry of directory · 10ab8811
      alex chen 提交于
      In ocfs2_dentry_convert_worker, we should prune the dcache before deleting
      the dentry of directory, otherwise, in the following cases the inode of
      directory will still remain in orphan directory until the device being
      umounted.
      
      Mount point: /mnt/ocfs2
      Node A                              Node B
      mkdir /mnt/ocfs2/testdir
        ocfs2_mkdir
        ->ocfs2_mknod
        ->ocfs2_dentry_attach_lock
        ->ocfs2_dentry_lock(dentry, 0)
        ... ...
      touch /mnt/ocfs2/testdir/testfile
                                          unlink /mnt/test/testdir/testfile
                                          rmdir /mnt/ocfs2/testdir
                                            ocfs2_unlink
                                            ->ocfs2_remote_dentry_delete
                                            ->ocfs2_dentry_lock(dentry, 1)
                                            ... ...
      ... ...
      ocfs2_downconvert_thread
      ->ocfs2_unblock_lock
      ->ocfs2_dentry_convert_worker
      ->ocfs2_find_local_alias
        ->dget_dlock
      ->d_delete
      Here the dentry can not be
      released because the children's
      dentry is negative but still exist.
      Finally, this inode will still remain
      in orphan directory until its children
      are destroyed.
      
      So before deleting dentry of directory, we should prune the dcache to
      remove unused children of the parent dentry by shrink_dcache_parent().
      Signed-off-by: NAlex Chen <alex.chen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10ab8811
  16. 11 12月, 2014 1 次提交
  17. 20 11月, 2014 1 次提交
  18. 10 10月, 2014 1 次提交
  19. 05 6月, 2014 1 次提交
  20. 04 4月, 2014 1 次提交
  21. 22 1月, 2014 2 次提交
  22. 15 11月, 2013 1 次提交
  23. 08 5月, 2013 1 次提交
    • Z
      aio: remove retry-based AIO · 41003a7b
      Zach Brown 提交于
      This removes the retry-based AIO infrastructure now that nothing in tree
      is using it.
      
      We want to remove retry-based AIO because it is fundemantally unsafe.
      It retries IO submission from a kernel thread that has only assumed the
      mm of the submitting task.  All other task_struct references in the IO
      submission path will see the kernel thread, not the submitting task.
      This design flaw means that nothing of any meaningful complexity can use
      retry-based AIO.
      
      This removes all the code and data associated with the retry machinery.
      The most significant benefit of this is the removal of the locking
      around the unused run list in the submission path.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41003a7b
  24. 22 2月, 2013 1 次提交
  25. 13 2月, 2013 1 次提交
  26. 04 7月, 2012 2 次提交
  27. 02 11月, 2011 1 次提交
  28. 01 6月, 2011 1 次提交
  29. 07 3月, 2011 1 次提交
    • T
      ocfs2: Remove EXIT from masklog. · c1e8d35e
      Tao Ma 提交于
      mlog_exit is used to record the exit status of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      This patch just try to remove it or change it. So:
      1. if all the error paths already use mlog_errno, it is just removed.
         Otherwise, it will be replaced by mlog_errno.
      2. if it is used to print some return value, it is replaced with
         mlog(0,...).
      mlog_exit_ptr is changed to mlog(0.
      All those mlog(0,...) will be replaced with trace events later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      c1e8d35e
  30. 21 2月, 2011 1 次提交
    • T
      ocfs2: Remove ENTRY from masklog. · ef6b689b
      Tao Ma 提交于
      ENTRY is used to record the entry of a function.
      But because it is added in so many functions, if we enable it,
      the system logs get filled up quickly and cause too much I/O.
      So actually no one can open it for a production system or even
      for a test.
      
      So for mlog_entry_void, we just remove it.
      for mlog_entry(...), we replace it with mlog(0,...), and they
      will be replace by trace event later.
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      ef6b689b
  31. 20 2月, 2011 1 次提交
    • S
      ocfs2: Use hrtimer to track ocfs2 fs lock stats · 5bc970e8
      Sunil Mushran 提交于
      Patch makes use of the hrtimer to track times in ocfs2 lock stats.
      
      The patch is a bit involved to ensure no additional impact on the memory
      footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.
      
      A related change was to modify the unit of the max wait time from nanosec to
      microsec allowing us to track max time larger than 4 secs. This change
      necessitated the bumping of the output version in the debugfs file,
      locking_state, from 2 to 3.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      5bc970e8
  32. 11 9月, 2010 1 次提交
    • G
      Track negative entries v3 · 5e98d492
      Goldwyn Rodrigues 提交于
      Track negative dentries by recording the generation number of the parent
      directory in d_fsdata. The generation number for the parent directory is
      recorded in the inode_info, which increments every time the lock on the
      directory is dropped.
      
      If the generation number of the parent directory and the negative dentry
      matches, there is no need to perform the revalidate, else a revalidate
      is forced. This improves performance in situations where nodes look for
      the same non-existent file multiple times.
      
      Thanks Mark for explaining the DLM sequence.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5e98d492
  33. 20 7月, 2010 1 次提交
  34. 22 5月, 2010 1 次提交
  35. 28 2月, 2010 1 次提交