1. 16 6月, 2020 3 次提交
    • Q
      jbd2: fix data races at struct journal_head · 8d6725cc
      Qian Cai 提交于
      task #28557737
      
      [ Upstream commit 6c5d911249290f41f7b50b43344a7520605b1acb ]
      
      journal_head::b_transaction and journal_head::b_next_transaction could
      be accessed concurrently as noticed by KCSAN,
      
       LTP: starting fsync04
       /dev/zero: Can't open blockdev
       EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
       EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
       ==================================================================
       BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
      
       write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
        __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
        __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
        jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
        (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
        kjournald2+0x13b/0x450 [jbd2]
        kthread+0x1cd/0x1f0
        ret_from_fork+0x27/0x50
      
       read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
        jbd2_write_access_granted+0x1b2/0x250 [jbd2]
        jbd2_write_access_granted at fs/jbd2/transaction.c:1155
        jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
        __ext4_journal_get_write_access+0x50/0x90 [ext4]
        ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
        ext4_mb_new_blocks+0x54f/0xca0 [ext4]
        ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
        ext4_map_blocks+0x3b4/0x950 [ext4]
        _ext4_get_block+0xfc/0x270 [ext4]
        ext4_get_block+0x3b/0x50 [ext4]
        __block_write_begin_int+0x22e/0xae0
        __block_write_begin+0x39/0x50
        ext4_write_begin+0x388/0xb50 [ext4]
        generic_perform_write+0x15d/0x290
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       5 locks held by fsync04/25724:
        #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
        #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
        #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
        #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
        #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
       irq event stamp: 1407125
       hardirqs last  enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
       hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
       softirqs last  enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
       softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      The plain reads are outside of jh->b_state_lock critical section which result
      in data races. Fix them by adding pairs of READ|WRITE_ONCE().
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8d6725cc
    • W
      jbd2: fix ocfs2 corrupt when clearing block group bits · cfc67fd4
      wangyan 提交于
      task #28557737
      
      commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6 upstream.
      
      I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
      The running environment:
      	kernel version: 4.19
      	A cluster with two nodes, 5 luns mounted on two nodes, and do some
      	file operations like dd/fallocate/truncate/rm on every lun with storage
      	network disconnection.
      
      The fallocate operation on dm-23-45 caused an null pointer dereference.
      
      The information of NULL pointer dereference as follows:
      	[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
      	[577992.878290] Aborting journal on device dm-23-45.
      	...
      	[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
      	[577992.890908] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
      	[577992.890918] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
      	[577992.890922] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
      	[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
      	[577992.890928] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
      	[577992.890933] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890939] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      	[577992.890950] Mem abort info:
      	[577992.890951]   ESR = 0x96000004
      	[577992.890952]   Exception class = DABT (current EL), IL = 32 bits
      	[577992.890952]   SET = 0, FnV = 0
      	[577992.890953]   EA = 0, S1PTW = 0
      	[577992.890954] Data abort info:
      	[577992.890955]   ISV = 0, ISS = 0x00000004
      	[577992.890956]   CM = 0, WnR = 0
      	[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
      	[577992.890960] [0000000000000020] pgd=0000000000000000
      	[577992.890964] Internal error: Oops: 96000004 [#1] SMP
      	[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
      	[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G        W  OE     4.19.36 #1
      	[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
      	[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
      	[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
      	[577992.891084] sp : ffff0000c8e2b810
      	[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
      	[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
      	[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
      	[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
      	[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
      	[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
      	[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
      	[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
      	[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
      	[577992.891103] x11: 0000000000000038 x10: 0101010101010101
      	[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
      	[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
      	[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
      	[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
      	[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
      	[577992.891116] Call trace:
      	[577992.891139]  _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891162]  _ocfs2_free_clusters+0x100/0x290 [ocfs2]
      	[577992.891185]  ocfs2_free_clusters+0x50/0x68 [ocfs2]
      	[577992.891206]  ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
      	[577992.891227]  ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
      	[577992.891248]  ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
      	[577992.891269]  ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
      	[577992.891290]  __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
      	[577992.891309]  ocfs2_fallocate+0xe4/0x128 [ocfs2]
      	[577992.891316]  vfs_fallocate+0x11c/0x250
      	[577992.891317]  ksys_fallocate+0x54/0x88
      	[577992.891319]  __arm64_sys_fallocate+0x28/0x38
      	[577992.891323]  el0_svc_common+0x78/0x130
      	[577992.891325]  el0_svc_handler+0x38/0x78
      	[577992.891327]  el0_svc+0x8/0xc
      
      My analysis process as follows:
      ocfs2_fallocate
        __ocfs2_change_file_space
          ocfs2_allocate_extents
            ocfs2_extend_allocation
              ocfs2_add_inode_data
                ocfs2_add_clusters_in_btree
                  ocfs2_insert_extent
                    ocfs2_do_insert_extent
                      ocfs2_rotate_tree_right
                        ocfs2_extend_rotate_transaction
                          ocfs2_extend_trans
                            jbd2_journal_restart
                              jbd2__journal_restart
                                /* handle->h_transaction is NULL,
                                 * is_handle_aborted(handle) is true
                                 */
                                handle->h_transaction = NULL;
                                start_this_handle
                                  return -EROFS;
                  ocfs2_free_clusters
                    _ocfs2_free_clusters
                      _ocfs2_free_suballoc_bits
                        ocfs2_block_group_clear_bits
                          ocfs2_journal_access_gd
                            __ocfs2_journal_access
                              jbd2_journal_get_undo_access
                                /* I think jbd2_write_access_granted() will
                                 * return true, because do_get_write_access()
                                 * will return -EROFS.
                                 */
                                if (jbd2_write_access_granted(...)) return 0;
                                do_get_write_access
                                  /* handle->h_transaction is NULL, it will
                                   * return -EROFS here, so do_get_write_access()
                                   * was not called.
                                   */
                                  if (is_handle_aborted(handle)) return -EROFS;
                          /* bh2jh(group_bh) is NULL, caused NULL
                             pointer dereference */
                          undo_bg = (struct ocfs2_group_desc *)
                                      bh2jh(group_bh)->b_committed_data;
      
      If handle->h_transaction == NULL, then jbd2_write_access_granted()
      does not really guarantee that journal_head will stay around,
      not even speaking of its b_committed_data. The bh2jh(group_bh)
      can be removed after ocfs2_journal_access_gd() and before call
      "bh2jh(group_bh)->b_committed_data". So, we should move
      is_handle_aborted() check from do_get_write_access() into
      jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
      before the call to jbd2_write_access_granted().
      
      Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: NYan Wang <wangyan122@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cfc67fd4
    • Z
      jbd2: move the clearing of b_modified flag to the journal_unmap_buffer() · cdf29ec3
      zhangyi (F) 提交于
      task #28557737
      
      [ Upstream commit 6a66a7ded12baa6ebbb2e3e82f8cb91382814839 ]
      
      There is no need to delay the clearing of b_modified flag to the
      transaction committing time when unmapping the journalled buffer, so
      just move it to the journal_unmap_buffer().
      
      Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cdf29ec3
  2. 18 3月, 2020 4 次提交
    • X
      alinux: jbd2: fix build errors · e7cec598
      Xiaoguang Wang 提交于
      Fix below build errors, 'wait_sum' and 'iowait_sum' need
      CONFIG_SCHEDSTATS to be configured.
      
      fs/jbd2/transaction.c: In function 'new_handle':
      fs/jbd2/transaction.c:406:51: error: 'struct sched_statistics' has no member named 'wait_sum'
          handle->h_sched_wait_sum = current->se.statistics.wait_sum;
      fs/jbd2/transaction.c:407:48: error: 'struct sched_statistics' has no member named 'iowait_sum'
          handle->h_io_wait_sum = current->se.statistics.iowait_sum;
      
      fs/jbd2/transaction.c: In function 'jbd2_journal_stop':
      fs/jbd2/transaction.c:1790:38: error: 'struct sched_statistics' has no member named 'wait_sum'
          sched_wait = current->se.statistics.wait_sum -
      fs/jbd2/transaction.c:1792:35: error: 'struct sched_statistics' has no member named 'iowait_sum'
          io_wait = current->se.statistics.iowait_sum -
      
      Fixes: 67393bb9f538 ("alinux: jbd2: track slow handle which is preventing transaction committing")
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e7cec598
    • J
      alinux: jbd2: fix build warnings · 54f827ce
      Joseph Qi 提交于
      Fix the following build warnings:
        fs/jbd2/transaction.o: In function `jbd2_journal_stop':
        (.text+0x2934): undefined reference to `__udivdi3'
        (.text+0x2970): undefined reference to `__udivdi3'
      
      Fixes: 861575c9 ("alinux: jbd2: track slow handle which is preventing transaction committing")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54f827ce
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 83cd9d23
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      83cd9d23
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
  3. 17 1月, 2020 1 次提交
    • Z
      jbd2: discard dirty data when forgetting an un-journalled buffer · 3aeefb45
      zhangyi (F) 提交于
      commit 597599268e3b91cac71faf48743f4783dec682fc upstream.
      
      We do not unmap and clear dirty flag when forgetting a buffer without
      journal or does not belongs to any transaction, so the invalid dirty
      data may still be written to the disk later. It's fine if the
      corresponding block is never used before the next mount, and it's also
      fine that we invoke clean_bdev_aliases() related functions to unmap
      the block device mapping when re-allocating such freed block as data
      block. But this logic is somewhat fragile and risky that may lead to
      data corruption if we forget to clean bdev aliases. So, It's better to
      discard dirty data during forget time.
      
      We have been already handled all the cases of forgetting journalled
      buffer, this patch deal with the remaining two cases.
      
      - buffer is not journalled yet,
      - buffer is journalled but doesn't belongs to any transaction.
      
      We invoke __bforget() instead of __brelese() when forgetting an
      un-journalled buffer in jbd2_journal_forget(). After this patch we can
      remove all clean_bdev_aliases() related calls in ext4.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3aeefb45
  4. 28 7月, 2019 1 次提交
    • R
      jbd2: introduce jbd2_inode dirty range scoping · af3812b6
      Ross Zwisler 提交于
      commit 6ba0e7dc64a5adcda2fbe65adc466891795d639e upstream.
      
      Currently both journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() operate on the entire address space
      of each of the inodes associated with a given journal entry.  The
      consequence of this is that if we have an inode where we are constantly
      appending dirty pages we can end up waiting for an indefinite amount of
      time in journal_finish_inode_data_buffers() while we wait for all the
      pages under writeback to be written out.
      
      The easiest way to cause this type of workload is do just dd from
      /dev/zero to a file until it fills the entire filesystem.  This can
      cause journal_finish_inode_data_buffers() to wait for the duration of
      the entire dd operation.
      
      We can improve this situation by scoping each of the inode dirty ranges
      associated with a given transaction.  We do this via the jbd2_inode
      structure so that the scoping is contained within jbd2 and so that it
      follows the lifetime and locking rules for that structure.
      
      This allows us to limit the writeback & wait in
      journal_submit_inode_data_buffers() and
      journal_finish_inode_data_buffers() respectively to the dirty range for
      a given struct jdb2_inode, keeping us from waiting forever if the inode
      in question is still being appended to.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af3812b6
  5. 22 5月, 2019 1 次提交
  6. 24 3月, 2019 2 次提交
    • Z
      jbd2: fix compile warning when using JBUFFER_TRACE · 584f390d
      zhangyi (F) 提交于
      commit 01215d3edb0f384ddeaa5e4a22c1ae5ff634149f upstream.
      
      The jh pointer may be used uninitialized in the two cases below and the
      compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
      
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      In file included from fs/jbd2/transaction.c:19:0:
      fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
      ./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
       #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                            ^
      fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
        struct journal_head *jh;
                             ^
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      584f390d
    • Z
      jbd2: clear dirty flag when revoking a buffer from an older transaction · dbe4bc99
      zhangyi (F) 提交于
      commit 904cdbd41d749a476863a0ca41f6f396774f26e4 upstream.
      
      Now, we capture a data corruption problem on ext4 while we're truncating
      an extent index block. Imaging that if we are revoking a buffer which
      has been journaled by the committing transaction, the buffer's jbddirty
      flag will not be cleared in jbd2_journal_forget(), so the commit code
      will set the buffer dirty flag again after refile the buffer.
      
      fsx                               kjournald2
                                        jbd2_journal_commit_transaction
      jbd2_journal_revoke                commit phase 1~5...
       jbd2_journal_forget
         belongs to older transaction    commit phase 6
         jbddirty not clear               __jbd2_journal_refile_buffer
                                           __jbd2_journal_unfile_buffer
                                            test_clear_buffer_jbddirty
                                             mark_buffer_dirty
      
      Finally, if the freed extent index block was allocated again as data
      block by some other files, it may corrupt the file data after writing
      cached pages later, such as during unmount time. (In general,
      clean_bdev_aliases() related helpers should be invoked after
      re-allocation to prevent the above corruption, but unfortunately we
      missed it when zeroout the head of extra extent blocks in
      ext4_ext_handle_unwritten_extents()).
      
      This patch mark buffer as freed and set j_next_transaction to the new
      transaction when it already belongs to the committing transaction in
      jbd2_journal_forget(), so that commit code knows it should clear dirty
      bits when it is done with the buffer.
      
      This problem can be reproduced by xfstests generic/455 easily with
      seeds (3246 3247 3248 3249).
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbe4bc99
  7. 17 6月, 2018 1 次提交
  8. 21 5月, 2018 1 次提交
  9. 18 4月, 2018 1 次提交
    • T
      ext4: set h_journal if there is a failure starting a reserved handle · b2569260
      Theodore Ts'o 提交于
      If ext4 tries to start a reserved handle via
      jbd2_journal_start_reserved(), and the journal has been aborted, this
      can result in a NULL pointer dereference.  This is because the fields
      h_journal and h_transaction in the handle structure share the same
      memory, via a union, so jbd2_journal_start_reserved() will clear
      h_journal before calling start_this_handle().  If this function fails
      due to an aborted handle, h_journal will still be NULL, and the call
      to jbd2_journal_free_reserved() will pass a NULL journal to
      sub_reserve_credits().
      
      This can be reproduced by running "kvm-xfstests -c dioread_nolock
      generic/475".
      
      Cc: stable@kernel.org # 3.11
      Fixes: 8f7d89f3 ("jbd2: transaction reservation support")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: NJan Kara <jack@suse.cz>
      b2569260
  10. 10 1月, 2018 1 次提交
    • T
      jbd2: fix sphinx kernel-doc build warnings · f69120ce
      Tobin C. Harding 提交于
      Sphinx emits various (26) warnings when building make target 'htmldocs'.
      Currently struct definitions contain duplicate documentation, some as
      kernel-docs and some as standard c89 comments.  We can reduce
      duplication while cleaning up the kernel docs.
      
      Move all kernel-docs to right above each struct member.  Use the set of
      all existing comments (kernel-doc and c89).  Add documentation for
      missing struct members and function arguments.
      Signed-off-by: NTobin C. Harding <me@tobin.cc>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f69120ce
  11. 18 12月, 2017 1 次提交
    • T
      ext4: fix up remaining files with SPDX cleanups · f5166768
      Theodore Ts'o 提交于
      A number of ext4 source files were skipped due because their copyright
      permission statements didn't match the expected text used by the
      automated conversion utilities.  I've added SPDX tags for the rest.
      
      While looking at some of these files, I've noticed that we have quite
      a bit of variation on the licenses that were used --- in particular
      some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
      and we have some files that have a LGPL-2.1 license (which was quite
      surprising).
      
      I've not attempted to do any license changes.  Even if it is perfectly
      legal to relicense to GPL 2.0-only for consistency's sake, that should
      be done with ext4 developer community discussion.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      f5166768
  12. 22 5月, 2017 1 次提交
    • T
      jbd2: preserve original nofs flag during journal restart · b4709067
      Tahsin Erdogan 提交于
      When a transaction starts, start_this_handle() saves current
      PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
      Journal restart is a special case that calls start_this_handle() without
      stopping the transaction. start_this_handle() isn't aware that the
      original value is already stored so it overwrites it with current value.
      
      For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
      at the end:
      
        jbd2_journal_start()
        jbd2__journal_restart()
        jbd2_journal_stop()
      
      Make jbd2__journal_restart() restore the original value before calling
      start_this_handle().
      
      Fixes: 81378da6 ("jbd2: mark the transaction context with the scope GFP_NOFS context")
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      b4709067
  13. 16 5月, 2017 2 次提交
  14. 04 5月, 2017 1 次提交
  15. 05 2月, 2017 1 次提交
  16. 13 10月, 2016 1 次提交
  17. 22 9月, 2016 1 次提交
  18. 30 6月, 2016 3 次提交
    • J
      jbd2: track more dependencies on transaction commit · 1eaa566d
      Jan Kara 提交于
      So far we were tracking only dependency on transaction commit due to
      starting a new handle (which may require commit to start a new
      transaction). Now add tracking also for other cases where we wait for
      transaction commit. This way lockdep can catch deadlocks e. g. because we
      call jbd2_journal_stop() for a synchronous handle with some locks held
      which rank below transaction start.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1eaa566d
    • J
      jbd2: move lockdep tracking to journal_s · ab714aff
      Jan Kara 提交于
      Currently lockdep map is tracked in each journal handle. To be able to
      expand lockdep support to cover also other cases where we depend on
      transaction commit and where handle is not available, move lockdep map
      into struct journal_s. Since this makes the lockdep map shared for all
      handles, we have to use rwsem_acquire_read() for acquisitions now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ab714aff
    • J
      jbd2: move lockdep instrumentation for jbd2 handles · 7a4b188f
      Jan Kara 提交于
      The transaction the handle references is free to commit once we've
      decremented t_updates counter. Move the lockdep instrumentation to that
      place. Currently it was a bit later which did not really matter but
      subsequent improvements to lockdep instrumentation would cause false
      positives with it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7a4b188f
  19. 24 4月, 2016 1 次提交
    • J
      jbd2: add support for avoiding data writes during transaction commits · 41617e1a
      Jan Kara 提交于
      Currently when filesystem needs to make sure data is on permanent
      storage before committing a transaction it adds inode to transaction's
      inode list. During transaction commit, jbd2 writes back all dirty
      buffers that have allocated underlying blocks and waits for the IO to
      finish. However when doing writeback for delayed allocated data, we
      allocate blocks and immediately submit the data. Thus asking jbd2 to
      write dirty pages just unnecessarily adds more work to jbd2 possibly
      writing back other redirtied blocks.
      
      Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
      outstanding data writes before committing a transaction and thus avoid
      unnecessary writes.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      41617e1a
  20. 18 4月, 2016 1 次提交
  21. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  22. 14 3月, 2016 1 次提交
    • M
      jbd2: do not fail journal because of frozen_buffer allocation failure · 490c1b44
      Michal Hocko 提交于
      Journal transaction might fail prematurely because the frozen_buffer
      is allocated by GFP_NOFS request:
      [   72.440013] do_get_write_access: OOM for frozen_buffer
      [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
      (...snipped....)
      [   72.495559] do_get_write_access: OOM for frozen_buffer
      [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.496839] do_get_write_access: OOM for frozen_buffer
      [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
      [   72.505766] Aborting journal on device sda1-8.
      [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
      
      This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
      allocations upon OOM" because small GPF_NOFS allocations never failed.
      This allocation seems essential for the journal and GFP_NOFS is too
      restrictive to the memory allocator so let's use __GFP_NOFAIL here to
      emulate the previous behavior.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      490c1b44
  23. 07 1月, 2016 1 次提交
  24. 05 12月, 2015 1 次提交
    • J
      jbd2: fix null committed data return in undo_access · 087ffd4e
      Junxiao Bi 提交于
      introduced jbd2_write_access_granted() to improve write|undo_access
      speed, but missed to check the status of b_committed_data which caused
      a kernel panic on ocfs2.
      
      [ 6538.405938] ------------[ cut here ]------------
      [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
      [ 6538.406686] invalid opcode: 0000 [#1] SMP
      [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
      [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
      [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
      [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
      [ 6538.406686] RIP: 0010:[<ffffffffa06a286b>]  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686] RSP: 0018:ffff880075b7b7f8  EFLAGS: 00010246
      [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
      [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
      [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
      [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
      [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
      [ 6538.406686] FS:  00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
      [ 6538.406686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
      [ 6538.406686] Stack:
      [ 6538.406686]  ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
      [ 6538.406686]  ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
      [ 6538.406686]  ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
      [ 6538.406686] Call Trace:
      [ 6538.406686]  [<ffffffffa06a309d>] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3654>] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a397e>] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a1750>] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0682d50>] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
      [ 6538.406686]  [<ffffffffa06a3ad5>] ocfs2_free_clusters+0x15/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa065072c>] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
      [ 6538.406686]  [<ffffffffa06843ac>] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654600>] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
      [ 6538.406686]  [<ffffffffa0654394>] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
      [ 6538.406686]  [<ffffffffa065acd4>] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
      [ 6538.406686]  [<ffffffffa017486b>] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
      [ 6538.406686]  [<ffffffffa065d5b5>] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
      [ 6538.406686]  [<ffffffffa0683430>] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
      [ 6538.406686]  [<ffffffffa067adb7>] ocfs2_truncate_file+0x297/0x380 [ocfs2]
      [ 6538.406686]  [<ffffffffa01759e4>] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
      [ 6538.406686]  [<ffffffffa067c7a2>] ocfs2_setattr+0x572/0x860 [ocfs2]
      [ 6538.406686]  [<ffffffff810e4a3f>] ? current_fs_time+0x3f/0x50
      [ 6538.406686]  [<ffffffff812124b7>] notify_change+0x1d7/0x340
      [ 6538.406686]  [<ffffffff8121abf9>] ? generic_getxattr+0x79/0x80
      [ 6538.406686]  [<ffffffff811f5876>] do_truncate+0x66/0x90
      [ 6538.406686]  [<ffffffff81120e30>] ? __audit_syscall_entry+0xb0/0x110
      [ 6538.406686]  [<ffffffff811f5bb3>] do_sys_ftruncate.clone.0+0xf3/0x120
      [ 6538.406686]  [<ffffffff811f5bee>] SyS_ftruncate+0xe/0x10
      [ 6538.406686]  [<ffffffff816aa2ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
      [ 6538.406686] RIP  [<ffffffffa06a286b>] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
      [ 6538.406686]  RSP <ffff880075b7b7f8>
      [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
      [ 6538.694492] Kernel panic - not syncing: Fatal exception
      [ 6538.695484] Kernel Offset: disabled
      
      Fixes: de92c8ca("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      087ffd4e
  25. 25 11月, 2015 1 次提交
    • J
      jbd2: Fix unreclaimed pages after truncate in data=journal mode · bc23f0c8
      Jan Kara 提交于
      Ted and Namjae have reported that truncated pages don't get timely
      reclaimed after being truncated in data=journal mode. The following test
      triggers the issue easily:
      
      for (i = 0; i < 1000; i++) {
      	pwrite(fd, buf, 1024*1024, 0);
      	fsync(fd);
      	fsync(fd);
      	ftruncate(fd, 0);
      }
      
      The reason is that journal_unmap_buffer() finds that truncated buffers
      are not journalled (jh->b_transaction == NULL), they are part of
      checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
      been already written out (!buffer_dirty(bh)). We clean such buffers but
      we leave them in the checkpoint list. Since checkpoint transaction holds
      a reference to the journal head, these buffers cannot be released until
      the checkpoint transaction is cleaned up. And at that point we don't
      call release_buffer_page() anymore so pages detached from mapping are
      lingering in the system waiting for reclaim to find them and free them.
      
      Fix the problem by removing buffers from transaction checkpoint lists
      when journal_unmap_buffer() finds out they don't have to be there
      anymore.
      Reported-and-tested-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Fixes: de1b7941Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bc23f0c8
  26. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  27. 04 8月, 2015 1 次提交
    • L
      jbd2: limit number of reserved credits · 6d3ec14d
      Lukas Czerner 提交于
      Currently there is no limitation on number of reserved credits we can
      ask for. If we ask for more reserved credits than 1/2 of maximum
      transaction size, or if total number of credits exceeds the maximum
      transaction size per operation (which is currently only possible with
      the former) we will spin forever in start_this_handle().
      
      Fix this by adding this limitation at the start of start_this_handle().
      
      This patch also removes the credit limitation 1/2 of maximum transaction
      size, since we really only want to limit the number of reserved credits.
      There is not much point to limit the credits if there is still space in
      the journal.
      
      This accidentally also fixes the online resize, where due to the
      limitation of the journal credits we're unable to grow file systems with
      1k block size and size between 16M and 32M. It has been partially fixed
      by 2c869b26, but not entirely.
      
      Thanks Jan Kara for helping me getting the correct fix.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6d3ec14d
  28. 13 7月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 6e06ae88
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      
      [ Note: this is a fixed version of commit 2143c196, which was
        reverted in ebeaa8dd due to a false positive triggering of an
        assertion check. -- Ted ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6e06ae88
  29. 28 6月, 2015 1 次提交
    • L
      Revert "jbd2: speedup jbd2_journal_dirty_metadata()" · ebeaa8dd
      Linus Torvalds 提交于
      This reverts commit 2143c196.
      
      This commit seems to be the cause of the following jbd2 assertion
      failure:
      
         ------------[ cut here ]------------
         kernel BUG at fs/jbd2/transaction.c:1325!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
         CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679 #1
         Hardware name:                  /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
         task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
         RIP: jbd2_journal_dirty_metadata+0x237/0x290
         Call Trace:
           __ext4_handle_dirty_metadata+0x43/0x1f0
           ext4_handle_dirty_dirent_node+0xde/0x160
           ? jbd2_journal_get_write_access+0x36/0x50
           ext4_delete_entry+0x112/0x160
           ? __ext4_journal_start_sb+0x52/0xb0
           ext4_unlink+0xfa/0x260
           vfs_unlink+0xec/0x190
           do_unlinkat+0x24a/0x270
           SyS_unlink+0x11/0x20
           entry_SYSCALL_64_fastpath+0x12/0x6a
         ---[ end trace ae033ebde8d080b4 ]---
      
      which is not easily reproducible (I've seen it just once, and then Ted
      was able to reproduce it once).  Revert it while Ted and Jan try to
      figure out what is wrong.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebeaa8dd
  30. 21 6月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_dirty_metadata() · 2143c196
      Jan Kara 提交于
      It is often the case that we mark buffer as having dirty metadata when
      the buffer is already in that state (frequent for bitmaps, inode table
      blocks, superblock). Thus it is unnecessary to contend on grabbing
      journal head reference and bh_state lock. Avoid that by checking whether
      any modification to the buffer is needed before grabbing any locks or
      references.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2143c196
  31. 09 6月, 2015 1 次提交
    • J
      jbd2: speedup jbd2_journal_get_[write|undo]_access() · de92c8ca
      Jan Kara 提交于
      jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
      frequently called for buffers that are already part of the running
      transaction - most frequently it is the case for bitmaps, inode table
      blocks, and superblock. Since in such cases we have nothing to do, it is
      unfortunate we still grab reference to journal head, lock the bh, lock
      bh_state only to find out there's nothing to do.
      
      Improving this is a bit subtle though since until we find out journal
      head is attached to the running transaction, it can disappear from under
      us because checkpointing / commit decided it's no longer needed. We deal
      with this by protecting journal_head slab with RCU. We still have to be
      careful about journal head being freed & reallocated within slab and
      about exposing journal head in consistent state (in particular
      b_modified and b_frozen_data must be in correct state before we allow
      user to touch the buffer).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      de92c8ca