1. 01 2月, 2018 36 次提交
    • P
      ocfs2: return error when we attempt to access a dirty bh in jbd2 · d984187e
      piaojun 提交于
      We should not reuse the dirty bh in jbd2 directly due to the following
      situation:
      
      1. When removing extent rec, we will dirty the bhs of extent rec and
         truncate log at the same time, and hand them over to jbd2.
      
      2. The bhs are submitted to jbd2 area successfully.
      
      3. The write-back thread of device help flush the bhs to disk but
         encounter write error due to abnormal storage link.
      
      4. After a while the storage link become normal. Truncate log flush
         worker triggered by the next space reclaiming found the dirty bh of
         truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
         __ocfs2_journal_access():
      
         ocfs2_truncate_log_worker
           ocfs2_flush_truncate_log
             __ocfs2_flush_truncate_log
               ocfs2_replay_truncate_records
                 ocfs2_journal_access_di
                   __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
      
      5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
         extent rec is still in error state, and unfortunately nobody will
         take care of it.
      
      6. At last the space of extent rec was not reduced, but truncate log
         flush worker have given it back to globalalloc. That will cause
         duplicate cluster problem which could be identified by fsck.ocfs2.
      
      Sadly we can hardly revert this but set fs read-only in case of ruining
      atomicity and consistency of space reclaim.
      
      Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
      Fixes: acf8fdbe ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
      Signed-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d984187e
    • C
      ocfs2: unlock bh_state if bg check fails · e75ed71b
      Changwei Ge 提交于
      We should unlock bh_stat if bg->bg_free_bits_count > bg->bg_bits
      
      Link: http://lkml.kernel.org/r/1516843095-23680-1-git-send-email-ge.changwei@h3c.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e75ed71b
    • G
      ocfs2: nowait aio support · c4c2416a
      Gang He 提交于
      Return EAGAIN if any of the following checks fail for direct I/O:
      
       - Cannot get the related locks immediately
      
       - Blocks are not allocated at the write location, it will trigger block
         allocation and block IO operations.
      
      [ghe@suse.com: v4]
        Link: http://lkml.kernel.org/r/1516007283-29932-4-git-send-email-ghe@suse.com
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-4-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-4-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4c2416a
    • G
      ocfs2: add ocfs2_overwrite_io() · ac604d3c
      Gang He 提交于
      Add ocfs2_overwrite_io function, which is used to judge if overwrite
      allocated blocks, otherwise, the write will bring extra block allocation
      overhead.
      
      [ghe@suse.com: v3]
        Link: http://lkml.kernel.org/r/1514455665-16325-3-git-send-email-ghe@suse.com
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-3-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-3-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: alex chen <alex.chen@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac604d3c
    • G
      ocfs2: add ocfs2_try_rw_lock() and ocfs2_try_inode_lock() · 06e7f13d
      Gang He 提交于
      Patch series "ocfs2: add nowait aio support", v4.
      
      VFS layer has introduced the non-blocking aio flag IOCB_NOWAIT, which
      tells the kernel to bail out if an AIO request will block for reasons
      such as file allocations, or writeback triggering, or would block while
      allocating requests while performing direct I/O.
      
      Subsequently, pwritev2/preadv2 also can leverage this part of kernel
      code.  So far, ext4/xfs/btrfs have supported this feature.  Add the
      related code for the ocfs2 file system.
      
      This patch (of 3):
      
      Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be
      used in non-blocking IO scenarios.
      
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-2-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-2-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Acked-by: Nalex chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06e7f13d
    • G
      ocfs2: add trimfs lock to avoid duplicated trims in cluster · 637dd20c
      Gang He 提交于
      ocfs2 supports trimming the underlying disk via the fstrim command.  But
      there is a problem, ocfs2 is a shared disk cluster file system, if the
      user configures a scheduled fstrim job on each file system node, this
      will trigger multiple nodes trimming a shared disk simultaneously, which
      is very wasteful for CPU and IO consumption.  This also might negatively
      affect the lifetime of poor-quality SSD devices.
      
      So we introduce a trimfs dlm lock to communicate with each other in this
      case, which will make only one fstrim command to do the trimming on a
      shared disk among the cluster.  The fstrim commands from the other nodes
      should wait for the first fstrim to finish and return success directly,
      to avoid running the same trim on the shared disk again.
      
      Link: http://lkml.kernel.org/r/1513228484-2084-2-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637dd20c
    • G
      ocfs2: add trimfs dlm lock resource · 4882abeb
      Gang He 提交于
      Introduce a new dlm lock resource, which will be used to communicate
      during fstrimming of an ocfs2 device from cluster nodes.
      
      Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4882abeb
    • C
      ocfs2: try to reuse extent block in dealloc without meta_alloc · 71a36944
      Changwei Ge 提交于
      A crash issue was reported by John Lightsey with a call trace as follows:
      
        ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
        ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
        ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
        ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
        dio_complete+0x19a/0x1a0
        do_blockdev_direct_IO+0x19dd/0x1eb0
        __blockdev_direct_IO+0x43/0x50
        ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
        generic_file_direct_write+0xb2/0x170
        __generic_file_write_iter+0xc3/0x1b0
        ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
        __vfs_write+0xae/0xf0
        vfs_write+0xb8/0x1b0
        SyS_write+0x4f/0xb0
        system_call_fastpath+0x16/0x75
      
      The BUG code told that extent tree wants to grow but no metadata was
      reserved ahead of time.  From my investigation into this issue, the root
      cause it that although enough metadata is not reserved, there should be
      enough for following use.  Rightmost extent is merged into its left one
      due to a certain times of marking extent written.  Because during
      marking extent written, we got many physically continuous extents.  At
      last, an empty extent showed up and the rightmost path is removed from
      extent tree.
      
      Add a new mechanism to reuse extent block cached in dealloc which were
      just unlinked from extent tree to solve this crash issue.
      
      Criteria is that during marking extents *written*, if extent rotation
      and merging results in unlinking extent with growing extent tree later
      without any metadata reserved ahead of time, try to reuse those extents
      in dealloc in which deleted extents are cached.
      
      Also, this patch addresses the issue John reported that ::dw_zero_count
      is not calculated properly.
      
      After applying this patch, the issue John reported was gone.  Thanks for
      the reproducer provided by John.  And this patch has passed
      ocfs2-test(29 cases) suite running by New H3C Group.
      
      [ge.changwei@h3c.com: fix static checker warnning]
        Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F29196AE@H3CMLB12-EX.srv.huawei-3com.com
      [akpm@linux-foundation.org: brelse(NULL) is legal]
      Link: http://lkml.kernel.org/r/1515479070-32653-2-git-send-email-ge.changwei@h3c.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reported-by: NJohn Lightsey <john@nixnuts.net>
      Tested-by: NJohn Lightsey <john@nixnuts.net>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71a36944
    • C
      ocfs2: make metadata estimation accurate and clear · 63de8bd9
      Changwei Ge 提交于
      Current code assume that ::w_unwritten_list always has only one item on.
      This is not right and hard to get understood.  So improve how to count
      unwritten item.
      
      Link: http://lkml.kernel.org/r/1515479070-32653-1-git-send-email-ge.changwei@h3c.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reported-by: NJohn Lightsey <john@nixnuts.net>
      Tested-by: NJohn Lightsey <john@nixnuts.net>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63de8bd9
    • P
      ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute · 16c8d569
      piaojun 提交于
      The race between *set_acl and *get_acl will cause getting incomplete
      xattr data as below:
      
        processA                                    processB
      
        ocfs2_set_acl
          ocfs2_xattr_set
            __ocfs2_xattr_set_handle
      
                                                    ocfs2_get_acl_nolock
                                                      ocfs2_xattr_get_nolock:
      
      processB may get incomplete xattr data if processA hasn't set_acl done.
      
      So we should use 'ip_xattr_sem' to protect getting extended attribute in
      ocfs2_get_acl_nolock(), as other processes could be changing it
      concurrently.
      
      Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c8d569
    • C
      ocfs2: clean up dead code in alloc.c · d22aa615
      Changwei Ge 提交于
      Some stack variables are no longer used but still assigned.  Trim them.
      
      Link: http://lkml.kernel.org/r/1516105069-12643-1-git-send-email-ge.changwei@h3c.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d22aa615
    • P
      ocfs2/xattr: assign errno to 'ret' in ocfs2_calc_xattr_init() · c0a1a6d7
      piaojun 提交于
      We need catch the errno returned by ocfs2_xattr_get_nolock() and assign
      it to 'ret' for printing and noticing upper callers.
      
      Link: http://lkml.kernel.org/r/5A571CAF.8050709@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Acked-by: NGang He <ghe@suse.com>
      Acked-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a1a6d7
    • G
      ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE · ff26cc10
      Gang He 提交于
      If we can't get inode lock immediately in the function
      ocfs2_inode_lock_with_page() when reading a page, we should not return
      directly here, since this will lead to a softlockup problem when the
      kernel is configured with CONFIG_PREEMPT is not set.  The method is to
      get a blocking lock and immediately unlock before returning, this can
      avoid CPU resource waste due to lots of retries, and benefits fairness
      in getting lock among multiple nodes, increase efficiency in case
      modifying the same file frequently from multiple nodes.
      
      The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
      looks like:
      
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          <IRQ>
          dump_stack+0x5c/0x82
          panic+0xd5/0x21e
          watchdog_timer_fn+0x208/0x210
          __hrtimer_run_queues+0xcc/0x200
          hrtimer_interrupt+0xa6/0x1f0
          smp_apic_timer_interrupt+0x34/0x50
          apic_timer_interrupt+0x96/0xa0
          </IRQ>
         RIP: 0010:unlock_page+0x17/0x30
         RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
         RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
         RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
         RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
         R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
         R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
          ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
          ocfs2_readpage+0x41/0x2d0 [ocfs2]
          filemap_fault+0x12b/0x5c0
          ocfs2_fault+0x29/0xb0 [ocfs2]
          __do_fault+0x1a/0xa0
          __handle_mm_fault+0xbe8/0x1090
          handle_mm_fault+0xaa/0x1f0
          __do_page_fault+0x235/0x4b0
          trace_do_page_fault+0x3c/0x110
          async_page_fault+0x28/0x30
         RIP: 0033:0x7fa75ded638e
         RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
         RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
         RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
         RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
         R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
         R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
      
      About performance improvement, we can see the testing time is reduced,
      and CPU utilization decreases, the detailed data is as follows.  I ran
      multi_mmap test case in ocfs2-test package in a three nodes cluster.
      
      Before applying this patch:
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
         1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
            5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
           95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
         2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
         2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
         2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 14:44:52 CST 2017
        multi_mmap..................................................Passed.
        Runtime 783 seconds.
      
      After apply this patch:
      
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
          155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
           95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
         2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
            5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
         2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
          299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
          335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
          535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
         1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 15:04:12 CST 2017
        multi_mmap..................................................Passed.
        Runtime 487 seconds.
      
      Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
      Fixes: 1cce4df0 ("ocfs2: do not lock/unlock() inode DLM lock")
      Signed-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NEric Ren <zren@suse.com>
      Acked-by: Nalex chen <alex.chen@huawei.com>
      Acked-by: Npiaojun <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff26cc10
    • P
      ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid · 025bcbde
      piaojun 提交于
      If metadata is corrupted such as 'invalid inode block', we will get
      failed by calling 'mount()' and then set filesystem readonly as below:
      
        ocfs2_mount
          ocfs2_initialize_super
            ocfs2_init_global_system_inodes
              ocfs2_iget
                ocfs2_read_locked_inode
                  ocfs2_validate_inode_block
      	      ocfs2_error
      	        ocfs2_handle_error
      	          ocfs2_set_ro_flag(osb, 0);  // set readonly
      
      In this situation we need return -EROFS to 'mount.ocfs2', so that user
      can fix it by fsck.  And then mount again.  In addition, 'mount.ocfs2'
      should be updated correspondingly as it only return 1 for all errno.
      And I will post a patch for 'mount.ocfs2' too.
      
      Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      025bcbde
    • C
      ocfs2: clean dead code in suballoc.c · dd7b5f9d
      Changwei Ge 提交于
      Stack variable fe is no longer used, so trim it to save some CPU cycles
      and stack space.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F5A8DD@H3CMLB14-EX.srv.huawei-3com.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd7b5f9d
    • A
      ocfs2: use the OCFS2_XATTR_ROOT_SIZE macro in ocfs2_reflink_xattr_header() · 32ed0bd7
      alex chen 提交于
      Use the OCFS2_XATTR_ROOT_SIZE macro improves the readability of the
      code.
      
      Link: http://lkml.kernel.org/r/5A2E2488.70301@huawei.comSigned-off-by: NAlex Chen <alex.chen@huawei.com>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32ed0bd7
    • Y
      ocfs2/cluster: close a race that fence can't be triggered · fc2af28b
      Yang Zhang 提交于
      When some nodes of cluster face with TCP connection fault, ocfs2 will
      pick up a quorum to continue to work and other nodes will be fenced by
      resetting host.
      
      In order to decide which node should be fenced, ocfs2 leverages
      o2quo_state::qs_holds.  If that variable is reduced to zero, then a try
      to decide if fence local node is performed.  However, under a specific
      scenario that local node is not disconnected from others at the same
      time, above method has a problem to reduce ::qs_holds to zero.
      
      Because, o2net 90s idle timer corresponding to different nodes is
      triggered one after another.
      
        node 2			node 3
        90s idle timer elapses
        clear ::qs_conn_bm
        set hold
      				40s is passed
      				90 idle timer elapses
      				clear ::qs_conn_bm
      				set hold
        still up timer elapses
        clear hold (NOT to zero )
        90s idle timer elapses AGAIN
      				still up timer elapses.
      				clear hold
      				still up timer elapses
      
      To solve this issue, a node which has already be evicted from
      ::qs_conn_bm can't set hold again and again invoked from idle timer.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F3F93B@H3CMLB12-EX.srv.huawei-3com.comSigned-off-by: NYang Zhang <zhang.yangB@h3c.com>
      Signed-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc2af28b
    • G
      ocfs2: give an obvious tip for mismatched cluster names · a52370b3
      Gang He 提交于
      Add an obvious error message, due to mismatched cluster names between
      on-disk and in the current cluster.  We can meet this case during OCFS2
      cluster migration.
      
      If we can give the user an obvious tip for why they can not mount the
      file system after migration, they can quickly fix this mismatch problem.
      
      Second, also move printing ocfs2_fill_super() errno to the front of
      ocfs2_dismount_volume(), since ocfs2_dismount_volume() will also print
      its own message.
      
      I looked through all the code of OCFS2 (include o2cb); there is not any
      place which returns this error.  In fact, the function calling path
      ocfs2_fill_super -> ocfs2_mount_volume -> ocfs2_dlm_init ->
      dlm_new_lockspace is a very specific one.  We can use this errno to give
      the user a more clear tip, since this case is a little common during
      cluster migration, but the customer can quickly get the failure cause if
      there is a error printed.  Also, I think it is not possible to add this
      errno in the o2cb path during ocfs2_dlm_init(), since the o2cb code has
      been stable for a long time.
      
      We only print this error tip when the user uses pcmk stack, since using
      the o2cb stack the user will not meet this error.
      
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1495419305-3780-1-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1495089336-19312-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
      Reviewed-by: NMark Fasheh <mfasheh@versity.com>
      Acked-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52370b3
    • C
      ocfs2/cluster: neaten a member of o2net_msg_handler · cfdce25c
      Changwei Ge 提交于
      It's odd that o2net_msg_handler::nh_func_data is declared as type
      o2net_msg_handler_func*.  So neaten it.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F554DA@H3CMLB14-EX.srv.huawei-3com.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: NAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfdce25c
    • C
      fs/ocfs2/dlm/dlmmaster.c: clean up dead code · e37b963c
      Changwei Ge 提交于
      This code has been commented out for 12 years.  Remove it.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED7EF9E@H3CMLB14-EX.srv.huawei-3com.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: alex chen <alex.chen@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e37b963c
    • S
      m32r: remove abort() · d91dad45
      Sudip Mukherjee 提交于
      Commit 7c2c11b2 ("arch: define weak abort()") has introduced a weak
      abort() which is common for all arch.  And, so we will not need arch
      specific abort which has the same code as the weak abort().  Remove the
      abort() for m32r.
      
      Link: http://lkml.kernel.org/r/1516912339-5665-1-git-send-email-sudipm.mukherjee@gmail.comSigned-off-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d91dad45
    • A
      scripts/tags.sh: change find_other_sources() for include directories · 99443f81
      Arend van Spriel 提交于
      The current find done in find_other_sources() excludes directories in
      the kernel tree that are named 'include', eg.:
      
      	./security/apparmor/include
      	./security/selinux/include
      	./drivers/net/wireless/broadcom/brcm80211/include
      	./drivers/gpu/drm/amd/acp/include
      	./drivers/gpu/drm/amd/display/include
      	./drivers/gpu/drm/amd/include
      	./drivers/gpu/drm/nouveau/include
      
      This changes the find command in find_other_sources() to include those
      using the -path option.
      
      Link: http://lkml.kernel.org/r/1513335768-7852-1-git-send-email-arend.vanspriel@broadcom.comSigned-off-by: NArend van Spriel <arend.vanspriel@broadcom.com>
      Cc: Robert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99443f81
    • A
      scripts/decodecode: make it take multiline Code line · 7e68b361
      Andy Shevchenko 提交于
      In case of running scripts/decodecode without any parameters in order to
      give a copy'n'pasted Code line from, for example, email it would parse
      only first line of it, while in emails it's split to few.
      
      ie, when you have a file out of oops the Code line looks like
      
        Code: hh hh ... <hh> ... hh\n
      
      When copy'n'paste from, for example, email where sender or some middle
      MTA split it, the line looks like:
      
        Code: hh hh ... hh\n
        hh ... <hh> ... hh\n
        hh hh ... hh\n
      
      The Code line followed by another oops line usually contains characters
      out of hex digit + space + < + > set.
      
      So add logic to join this split back if and only if the following lines
      have hex digits, or spaces, or '<', or '>' characters.  It will be quite
      unlikely to have a broken input in well formed Oops or dmesg, thus a
      simple regex is being used.
      
      Link: http://lkml.kernel.org/r/20171212100323.33201-1-andriy.shevchenko@linux.intel.comSigned-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e68b361
    • J
      fs/dax.c: release PMD lock even when there is no PMD support in DAX · ee190ca6
      Jan H. Schönherr 提交于
      follow_pte_pmd() can theoretically return after having acquired a PMD
      lock, even when DAX was not compiled with CONFIG_FS_DAX_PMD.
      
      Release the PMD lock unconditionally.
      
      Link: http://lkml.kernel.org/r/20180118133839.20587-1-jschoenh@amazon.de
      Fixes: f729c8c9 ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee190ca6
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 7b1cd95d
      Linus Torvalds 提交于
      Pull RDMA subsystem updates from Jason Gunthorpe:
       "Overall this cycle did not have any major excitement, and did not
        require any shared branch with netdev.
      
        Lots of driver updates, particularly of the scale-up and performance
        variety. The largest body of core work was Parav's patches fixing and
        restructing some of the core code to make way for future RDMA
        containerization.
      
        Summary:
      
         - misc small driver fixups to
           bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
      
         - several major feature adds to bnxt_re driver: SRIOV VF RoCE
           support, HugePages support, extended hardware stats support, and
           SRQ support
      
         - a notable number of fixes to the i40iw driver from debugging scale
           up testing
      
         - more work to enable the new hip08 chip in the hns driver
      
         - misc small ULP fixups to srp/srpt//ipoib
      
         - preparation for srp initiator and target to support the RDMA-CM
           protocol for connections
      
         - add RDMA-CM support to srp initiator, srp target is still a WIP
      
         - fixes for a couple of places where ipoib could spam the dmesg log
      
         - fix encode/decode of FDR/EDR data rates in the core
      
         - many patches from Parav with ongoing work to clean up
           inconsistencies and bugs in RoCE support around the rdma_cm
      
         - mlx5 driver support for the userspace features 'thread domain',
           'wallclock timestamps' and 'DV Direct Connected transport'. Support
           for the firmware dual port rocee capability
      
         - core support for more than 32 rdma devices in the char dev
           allocation
      
         - kernel doc updates from Randy Dunlap
      
         - new netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
      
         - one minor change to the kobject code acked by Greg KH"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (259 commits)
        RDMA/nldev: Provide detailed QP information
        RDMA/nldev: Provide global resource utilization
        RDMA/core: Add resource tracking for create and destroy PDs
        RDMA/core: Add resource tracking for create and destroy CQs
        RDMA/core: Add resource tracking for create and destroy QPs
        RDMA/restrack: Add general infrastructure to track RDMA resources
        RDMA/core: Save kernel caller name when creating PD and CQ objects
        RDMA/core: Use the MODNAME instead of the function name for pd callers
        RDMA: Move enum ib_cq_creation_flags to uapi headers
        IB/rxe: Change RDMA_RXE kconfig to use select
        IB/qib: remove qib_keys.c
        IB/mthca: remove mthca_user.h
        RDMA/cm: Fix access to uninitialized variable
        RDMA/cma: Use existing netif_is_bond_master function
        IB/core: Avoid SGID attributes query while converting GID from OPA to IB
        RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
        IB/umad: Fix use of unprotected device pointer
        IB/iser: Combine substrings for three messages
        IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
        IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
        ...
      7b1cd95d
    • L
      Merge tag 'dmaengine-4.16-rc1' of git://git.infradead.org/users/vkoul/slave-dma · 2155e69a
      Linus Torvalds 提交于
      Pull dmaengine updates from Vinod Koul:
       "This time is smallish update with updates mainly to drivers:
      
         - updates to xilinx and zynqmp dma controllers
      
         - update reside calculation for rcar controller
      
         - more RSTify fixes for documentation
      
         - add support for race free transfer termination and updating for
           users for that
      
         - support for new rev of hidma with addition new APIs to get device
           match data in ACPI/OF
      
         - random updates to bunch of other drivers"
      
      * tag 'dmaengine-4.16-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (47 commits)
        dmaengine: dmatest: fix container_of member in dmatest_callback
        dmaengine: stm32-dmamux: Remove unnecessary platform_get_resource() error check
        dmaengine: sprd: statify 'sprd_dma_prep_dma_memcpy'
        dmaengine: qcom_hidma: simplify DT resource parsing
        dmaengine: xilinx_dma: Free BD consistent memory
        dmaengine: xilinx_dma: Fix warning variable prev set but not used
        dmaengine: xilinx_dma: properly configure the SG mode bit in the driver for cdma
        dmaengine: doc: format struct fields using monospace
        dmaengine: doc: fix bullet list formatting
        dmaengine: ti-dma-crossbar: Fix event mapping for TPCC_EVT_MUX_60_63
        dmaengine: cppi41: Fix channel queues array size check
        dmaengine: imx-sdma: Add MODULE_FIRMWARE
        dmaengine: xilinx_dma: Fix typos
        dmaengine: xilinx_dma: Differentiate probe based on the ip type
        dmaengine: xilinx_dma: fix style issues from checkpatch
        dmaengine: xilinx_dma: Fix kernel doc warnings
        dmaengine: xilinx_dma: Fix race condition in the driver for multiple descriptor scenario
        dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in vdma
        dmaengine: xilinx_dma: Check for channel idle state before submitting dma descriptor
        dmaengine: zynqmp_dma: Fix race condition in the probe
        ...
      2155e69a
    • L
      Merge tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping · 2382dc9a
      Linus Torvalds 提交于
      Pull dma mapping updates from Christoph Hellwig:
       "Except for a runtime warning fix from Christian this is all about
        consolidation of the generic no-IOMMU code, a well as the glue code
        for swiotlb.
      
        All the code is based on the x86 implementation with hooks to allow
        all architectures that aren't cache coherent to use it.
      
        The x86 conversion itself has been deferred because the x86
        maintainers were a little busy in the last months"
      
      * tag 'dma-mapping-4.16' of git://git.infradead.org/users/hch/dma-mapping: (57 commits)
        MAINTAINERS: add the iommu list for swiotlb and xen-swiotlb
        arm64: use swiotlb_alloc and swiotlb_free
        arm64: replace ZONE_DMA with ZONE_DMA32
        mips: use swiotlb_{alloc,free}
        mips/netlogic: remove swiotlb support
        tile: use generic swiotlb_ops
        tile: replace ZONE_DMA with ZONE_DMA32
        unicore32: use generic swiotlb_ops
        ia64: remove an ifdef around the content of pci-dma.c
        ia64: clean up swiotlb support
        ia64: use generic swiotlb_ops
        ia64: replace ZONE_DMA with ZONE_DMA32
        swiotlb: remove various exports
        swiotlb: refactor coherent buffer allocation
        swiotlb: refactor coherent buffer freeing
        swiotlb: wire up ->dma_supported in swiotlb_dma_ops
        swiotlb: add common swiotlb_map_ops
        swiotlb: rename swiotlb_free to swiotlb_exit
        x86: rename swiotlb_dma_ops
        powerpc: rename swiotlb_dma_ops
        ...
      2382dc9a
    • L
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 28bc6fb9
      Linus Torvalds 提交于
      Pull SCSI updates from James Bottomley:
       "This is mostly updates of the usual driver suspects: arcmsr,
        scsi_debug, mpt3sas, lpfc, cxlflash, qla2xxx, aacraid, megaraid_sas,
        hisi_sas.
      
        We also have a rework of the libsas hotplug handling to make it more
        robust, a slew of 32 bit time conversions and fixes, and a host of the
        usual minor updates and style changes. The biggest potential for
        regressions is the libsas hotplug changes, but so far they seem stable
        under testing"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (313 commits)
        scsi: qla2xxx: Fix logo flag for qlt_free_session_done()
        scsi: arcmsr: avoid do_gettimeofday
        scsi: core: Add VENDOR_SPECIFIC sense code definitions
        scsi: qedi: Drop cqe response during connection recovery
        scsi: fas216: fix sense buffer initialization
        scsi: ibmvfc: Remove unneeded semicolons
        scsi: hisi_sas: fix a bug in hisi_sas_dev_gone()
        scsi: hisi_sas: directly attached disk LED feature for v2 hw
        scsi: hisi_sas: devicetree: bindings: add LED feature for v2 hw
        scsi: megaraid_sas: NVMe passthrough command support
        scsi: megaraid: use ktime_get_real for firmware time
        scsi: fnic: use 64-bit timestamps
        scsi: qedf: Fix error return code in __qedf_probe()
        scsi: devinfo: fix format of the device list
        scsi: qla2xxx: Update driver version to 10.00.00.05-k
        scsi: qla2xxx: Add XCB counters to debugfs
        scsi: qla2xxx: Fix queue ID for async abort with Multiqueue
        scsi: qla2xxx: Fix warning for code intentation in __qla24xx_handle_gpdb_event()
        scsi: qla2xxx: Fix warning during port_name debug print
        scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
        ...
      28bc6fb9
    • L
      Merge tag 'for-4.16/dm-changes' of... · 0be600a5
      Linus Torvalds 提交于
      Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - DM core fixes to ensure that bio submission follows a depth-first
         tree walk; this is critical to allow forward progress without the
         need to use the bioset's BIOSET_NEED_RESCUER.
      
       - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
      
       - DM core cleanups and improvements to make bio-based DM more efficient
         (e.g. reduced memory footprint as well leveraging per-bio-data more).
      
       - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
         the more direct IO submission path in the block layer; this mode is
         used by DM multipath and also optimizes targets like DM thin-pool
         that stack directly on NVMe data device.
      
       - DM multipath improvements to factor out legacy SCSI-only (e.g.
         scsi_dh) code paths to allow for more optimized support for NVMe
         multipath.
      
       - A fix for DM multipath path selectors (service-time and queue-length)
         to select paths in a more balanced way; largely academic but doesn't
         hurt.
      
       - Numerous DM raid target fixes and improvements.
      
       - Add a new DM "unstriped" target that enables Intel to workaround
         firmware limitations in some NVMe drives that are striped internally
         (this target also works when stacked above the DM "striped" target).
      
       - Various Documentation fixes and improvements.
      
       - Misc cleanups and fixes across various DM infrastructure and targets
         (e.g. bufio, flakey, log-writes, snapshot).
      
      * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
        dm cache: Documentation: update default migration_throttling value
        dm mpath selector: more evenly distribute ties
        dm unstripe: fix target length versus number of stripes size check
        dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
        dm table: fix NVMe bio-based dm_table_determine_type() validation
        dm: various cleanups to md->queue initialization code
        dm mpath: delay the retry of a request if the target responded as busy
        dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
        dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
        dm log writes: fix max length used for kstrndup
        dm: backfill missing calls to mutex_destroy()
        dm snapshot: use mutex instead of rw_semaphore
        dm flakey: check for null arg_name in parse_features()
        dm thin: extend thinpool status format string with omitted fields
        dm thin: fixes in thin-provisioning.txt
        dm thin: document representation of <highest mapped sector> when there is none
        dm thin: fix documentation relative to low water mark threshold
        dm cache: be consistent in specifying sectors and SI units in cache.txt
        dm cache: delete obsoleted paragraph in cache.txt
        dm cache: fix grammar in cache-policies.txt
        ...
      0be600a5
    • L
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · 040639b7
      Linus Torvalds 提交于
      Pull MD updates from Shaohua Li:
       "Some small fixes for MD:
      
         - fix raid5-cache potential problems if raid5 cache isn't fully
           recovered
      
         - fix a wait-within-wait warning in raid1/10
      
         - make raid5-PPL support disks with writeback cache enabled"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        raid5-ppl: PPL support for disks with write-back cache enabled
        md/r5cache: print more info of log recovery
        md/raid1,raid10: silence warning about wait-within-wait
        md: introduce new personality funciton start()
      040639b7
    • L
      Merge tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 20c59c71
      Linus Torvalds 提交于
      Pull xfs updates from Darrick Wong:
       "This merge cycle, we're again some substantive changes to XFS.
      
        Metadata verifiers have been restructured to provide more detail about
        which part of a metadata structure failed checks, and we've enhanced
        the new online fsck feature to cross-reference extent allocation
        information with the other metadata structures. With this pull, the
        metadata verification part of online fsck is more or less finished,
        though the feature is still experimental and still disabled by
        default.
      
        We're also preparing to remove the EXPERIMENTAL tag from a couple of
        features this cycle. This week we're committing a bunch of space
        accounting fixes for reflink and removing the EXPERIMENTAL tag from
        reflink; I anticipate that we'll be ready to do the same for the
        reverse mapping feature next week. (I don't have any pending fixes for
        rmap; however I wish to remove the tags one at a time.)
      
        This giant pile of patches has been run through a full xfstests run
        over the weekend and through a quick xfstests run against this
        morning's master, with no major failures reported. Let me know if
        there's any merge problems -- git merge reported that one of our
        patches touched the same function as the i_version series, but it
        resolved things cleanly.
      
        Summary:
      
         - Log faulting code locations when verifiers fail, for improved
           diagnosis of corrupt filesystems.
      
         - Implement metadata verifiers for local format inode fork data.
      
         - Online scrub now cross-references metadata records with other
           metadata.
      
         - Refactor the fs geometry ioctl generation functions.
      
         - Harden various metadata verifiers.
      
         - Fix various accounting problems.
      
         - Fix uncancelled transactions leaking when xattr functions fail.
      
         - Prevent the copy-on-write speculative preallocation garbage
           collector from racing with writeback.
      
         - Emit log reservation type information as trace data so that we can
           compare against xfsprogs.
      
         - Fix some erroneous asserts in the online scrub code.
      
         - Clean up the transaction reservation calculations.
      
         - Fix various minor bugs in online scrub.
      
         - Log complaints about mixed dio/buffered writes once per day and
           less noisily than before.
      
         - Refactor buffer log item lists to use list_head.
      
         - Break PNFS leases before reflinking blocks.
      
         - Reduce lock contention on reflink source files.
      
         - Fix some quota accounting problems with reflink.
      
         - Fix a serious corruption problem in the direct cow write code where
           we fed bad iomaps to the vfs iomap consumers.
      
         - Various other refactorings.
      
         - Remove EXPERIMENTAL tag from reflink!"
      
      * tag 'xfs-4.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (94 commits)
        xfs: remove experimental tag for reflinks
        xfs: don't screw up direct writes when freesp is fragmented
        xfs: check reflink allocation mappings
        iomap: warn on zero-length mappings
        xfs: treat CoW fork operations as delalloc for quota accounting
        xfs: only grab shared inode locks for source file during reflink
        xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes
        xfs: reflink should break pnfs leases before sharing blocks
        xfs: don't clobber inobt/finobt cursors when xref with rmap
        xfs: skip CoW writes past EOF when writeback races with truncate
        xfs: preserve i_rdev when recycling a reclaimable inode
        xfs: refactor accounting updates out of xfs_bmap_btalloc
        xfs: refactor inode verifier corruption error printing
        xfs: make tracepoint inode number format consistent
        xfs: always zero di_flags2 when we free the inode
        xfs: call xfs_qm_dqattach before performing reflink operations
        xfs: bmap code cleanup
        Use list_head infra-structure for buffer's log items list
        Split buffer's b_fspriv field
        Get rid of xfs_buf_log_item_t typedef
        ...
      20c59c71
    • L
      Merge branch 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 5a87e37e
      Linus Torvalds 提交于
      Pull get_user_pages_fast updates from Al Viro:
       "A bit more get_user_pages work"
      
      * 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        kvm: switch get_user_page_nowait() to get_user_pages_unlocked()
        __get_user_pages_locked(): get rid of notify_drop argument
        get_user_pages_unlocked(): pass true to __get_user_pages_locked() notify_drop
        cris: switch to get_user_pages_fast()
        fold __get_user_pages_unlocked() into its sole remaining caller
      5a87e37e
    • L
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 19e7b5f9
      Linus Torvalds 提交于
      Pull misc vfs updates from Al Viro:
       "All kinds of misc stuff, without any unifying topic, from various
        people.
      
        Neil's d_anon patch, several bugfixes, introduction of kvmalloc
        analogue of kmemdup_user(), extending bitfield.h to deal with
        fixed-endians, assorted cleanups all over the place..."
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
        alpha: osf_sys.c: use timespec64 where appropriate
        alpha: osf_sys.c: fix put_tv32 regression
        jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
        dcache: delete unused d_hash_mask
        dcache: subtract d_hash_shift from 32 in advance
        fs/buffer.c: fold init_buffer() into init_page_buffers()
        fs: fold __inode_permission() into inode_permission()
        fs: add RWF_APPEND
        sctp: use vmemdup_user() rather than badly open-coding memdup_user()
        snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
        replace_user_tlv(): switch to vmemdup_user()
        new primitive: vmemdup_user()
        memdup_user(): switch to GFP_USER
        eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
        eventfd: fold eventfd_ctx_read() into eventfd_read()
        eventfd: convert to use anon_inode_getfd()
        nfs4file: get rid of pointless include of btrfs.h
        uvc_v4l2: clean copyin/copyout up
        vme_user: don't use __copy_..._user()
        usx2y: don't bother with memdup_user() for 16-byte structure
        ...
      19e7b5f9
    • L
      Merge tag 'gfs2-4.16.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 26064ea4
      Linus Torvalds 提交于
      Pull GFS2 updates from Bob Peterson:
       "We've got 30 patches for this merge window. These generally fall into
        five categories:
      
         - code cleanups
      
         - patches related to adding PUNCH_HOLE support to GFS2
      
         - support for new fields in resource group headers
      
         - a few bug fixes
      
         - support for new fields in journal log headers. These new fields,
           which were previously unused, are designed to make it easier to
           track down file system corruption, and allow fsck.gfs2 to make more
           intelligent decisions when finding and fixing file system
           corruption.
      
        Details:
      
         - Two patches from Abhi Das, to trim the ordered writes list, which
           used to grow uncontrollably until unmount.
      
         - Several patches from Andreas Gruenbacher: remove an unused
           parameter from function gfs2_write_jdata_pagevec, remove a
           pointless BUG_ON, clean up an error patch in trunc_start, remove
           some unused parameters from truncate, make gfs2_journaled_truncate
           more efficient, clean up the support functions for truncate, fix
           metadata read-ahead for truncate to make it faster, fix up the
           non-recursive truncate code, rework and rename
           gfs2_block_truncate_page, generalize the non-recursive truncate
           code so it can take a range of values for punch_hole support,
           introduce new PUNCH_HOLE support that take advantage of the
           previous patches, add fallocate support with PUNCH_HOLE, fix some
           typos in the comments, add the function gfs2_max_stuffed_size to
           replace a piece of code that was needlessly repeated throughout
           GFS2, a minor cleanup to function gfs2_page_add_databufs, get rid
           of function gfs2_log_header_in in preparation for the new log
           header fields, and also fix up some missing newlines in kernel
           messages.
      
         - Andy Price added a new field to resource groups to indicate where
           the next one should be, to allow fsck.gfs2 to make better repairs.
           He also added new rindex fields for consistency checking, and added
           a crc field to resource group headers for consistency checking.
      
         - I reduced redundancy in functions common to freeing dinodes, and
           when writing log headers between the journalling code and journal
           recovery code. Also added new fields to journal log headers based
           on a prototype from Steve Whitehouse, and log the source of journal
           log headers so we can better track down journal corruption. Minor
           comment typo fix and a fix for a BUG in an unlink error path.
      
         - Steve Whitehouse contributed a patch to fix an incorrect use of the
           gfs2_blk2rgrpd function.
      
         - Tetsuo Handa contributed a patch that fixes incorrect error
           handling in function init_gfs2_fs"
      
      * tag 'gfs2-4.16.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (30 commits)
        gfs2: Add a few missing newlines in messages
        gfs2: Remove inode from ordered write list in gfs2_write_inode()
        GFS2: Don't try to end a non-existent transaction in unlink
        GFS2: Fix minor comment typo
        GFS2: Log the reason for log flushes in every log header
        GFS2: Introduce new gfs2_log_header_v2
        gfs2: Get rid of gfs2_log_header_in
        gfs2: Minor gfs2_page_add_databufs cleanup
        gfs2: Add gfs2_max_stuffed_size
        gfs2: Typo fixes
        gfs2: Implement fallocate(FALLOC_FL_PUNCH_HOLE)
        gfs2: Turn trunc_dealloc into punch_hole
        gfs2: Generalize truncate code
        Turn gfs2_block_truncate_page into gfs2_block_zero_range
        gfs2: Improve non-recursive delete algorithm
        gfs2: Fix metadata read-ahead during truncate
        gfs2: Clean up {lookup,fillup}_metapath
        gfs2: Remove minor gfs2_journaled_truncate inefficiencies
        gfs2: truncate: Remove unnecessary oldsize parameters
        gfs2: Clean up trunc_start error path
        ...
      26064ea4
    • E
      devpts: fix error handling in devpts_mntget() · c9cc8d01
      Eric Biggers 提交于
      If devpts_ptmx_path() returns an error code, then devpts_mntget()
      dereferences an ERR_PTR():
      
          BUG: unable to handle kernel paging request at fffffffffffffff5
          IP: devpts_mntget+0x13f/0x280 fs/devpts/inode.c:173
      
      Fix it by returning early in the error paths.
      
      Reproducer:
      
          #define _GNU_SOURCE
          #include <fcntl.h>
          #include <sched.h>
          #include <sys/ioctl.h>
          #define TIOCGPTPEER _IO('T', 0x41)
      
          int main()
          {
              for (;;) {
                  int fd = open("/dev/ptmx", 0);
                  unshare(CLONE_NEWNS);
                  ioctl(fd, TIOCGPTPEER, 0);
              }
          }
      
      Fixes: 311fc65c ("pty: Repair TIOCGPTPEER")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: <stable@vger.kernel.org> # v4.13+
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9cc8d01
    • J
      iversion: make inode_cmp_iversion{+raw} return bool instead of s64 · c0cef30e
      Jeff Layton 提交于
      As Linus points out:
      
          The inode_cmp_iversion{+raw}() functions are pure and utter crap.
      
          Why?
      
          You say that they return 0/negative/positive, but they do so in a
          completely broken manner. They return that ternary value as the
          sequence number difference in a 's64', which means that if you
          actually care about that ternary value, and do the *sane* thing that
          the kernel-doc of the function implies is the right thing, you would
          do
      
              int cmp = inode_cmp_iversion(inode, old);
              if (cmp < 0 ...
      
          and as a result you get code that looks sane, but that doesn't
          actually *WORK* right.
      
      Since none of the callers actually care about the ternary value here,
      convert the inode_cmp_iversion{+raw} functions to just return a boolean
      value (false for matching, true for non-matching).
      
      This matches the existing use of these functions just fine, and makes it
      simple to convert them to return a ternary value in the future if we
      grow callers that need it.
      
      With this change we can also reimplement inode_cmp_iversion in a simpler
      way using inode_peek_iversion.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0cef30e
  2. 31 1月, 2018 4 次提交