1. 28 5月, 2013 2 次提交
    • L
      ext4: make punch hole code path work with bigalloc · d23142c6
      Lukas Czerner 提交于
      Currently punch hole is disabled in file systems with bigalloc
      feature enabled. However the recent changes in punch hole patch should
      make it easier to support punching holes on bigalloc enabled file
      systems.
      
      This commit changes partial_cluster handling in ext4_remove_blocks(),
      ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
      partial_cluster is unsigned long long type and it makes sure that we
      will free the partial cluster if all extents has been released from that
      cluster. However it has been specifically designed only for truncate.
      
      With punch hole we can be freeing just some extents in the cluster
      leaving the rest untouched. So we have to make sure that we will notice
      cluster which still has some extents. To do this I've changed
      partial_cluster to be signed long long type. The only scenario where
      this could be a problem is when cluster_size == block size, however in
      that case there would not be any partial clusters so we're safe. For
      bigger clusters the signed type is enough. Now we use the negative value
      in partial_cluster to mark such cluster used, hence we know that we must
      not free it even if all other extents has been freed from such cluster.
      
      This scenario can be described in simple diagram:
      
      |FFF...FF..FF.UUU|
       ^----------^
        punch hole
      
      . - free space
      | - cluster boundary
      F - freed extent
      U - used extent
      
      Also update respective tracepoints to use signed long long type for
      partial_cluster.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d23142c6
    • L
      ext4: update ext4_ext_remove_space trace point · 61801325
      Lukas Czerner 提交于
      Add "end" variable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      61801325
  2. 22 5月, 2013 1 次提交
  3. 03 5月, 2013 1 次提交
    • Y
      ext4: fix fio regression · e30b5dca
      Yan, Zheng 提交于
      We (Linux Kernel Performance project) found a regression introduced
      by commit:
      
        f7fec032 ext4: track all extent status in extent status tree
      
      The commit causes about 20% performance decrease in fio random write
      test. Profiler shows that rb_next() uses a lot of CPU time. The call
      stack is:
      
        rb_next
        ext4_es_find_delayed_extent
        ext4_map_blocks
        _ext4_get_block
        ext4_get_block_write
        __blockdev_direct_IO
        ext4_direct_IO
        generic_file_direct_write
        __generic_file_aio_write
        ext4_file_write
        aio_rw_vect_retry
        aio_run_iocb
        do_io_submit
        sys_io_submit
        system_call_fastpath
        io_submit
        td_io_getevents
        io_u_queued_complete
        thread_main
        main
        __libc_start_main
      
      The cause is that ext4_es_find_delayed_extent() doesn't have an
      upper bound, it keeps searching until a delayed extent is found.
      When there are a lots of non-delayed entries in the extent state
      tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.
      Reported-by: NLKP project <lkp@linux.intel.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      e30b5dca
  4. 10 4月, 2013 1 次提交
  5. 04 4月, 2013 1 次提交
  6. 01 3月, 2013 1 次提交
    • T
      ext4: optimize ext4_es_shrink() · 24630774
      Theodore Ts'o 提交于
      When the system is under memory pressure, ext4_es_srhink() will get
      called very often.  So optimize returning the number of items in the
      file system's extent status cache by keeping a per-filesystem count,
      instead of calculating it each time by scanning all of the inodes in
      the extent status cache.
      
      Also rename the slab used for the extent status cache to be
      "ext4_extent_status" so it's obviousl the slab in question is created
      by ext4.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      24630774
  7. 18 2月, 2013 5 次提交
    • Z
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu 提交于
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Z
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu 提交于
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Z
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu 提交于
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363
    • Z
      ext4: add physical block and status member into extent status tree · fdc0212e
      Zheng Liu 提交于
      This commit adds two members in extent_status structure to let it record
      physical block and extent status.  Here es_pblk is used to record both
      of them because physical block only has 48 bits.  So extent status could
      be stashed into it so that we can save some memory.  Now written,
      unwritten, delayed and hole are defined as status.
      
      Due to new member is added into extent status tree, all interfaces need
      to be adjusted.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      fdc0212e
    • Z
      ext4: refine extent status tree · 06b0c886
      Zheng Liu 提交于
      This commit refines the extent status tree code.
      
      1) A prefix 'es_' is added to to the extent status tree structure
      members.
      
      2) Refactored es_remove_extent() so that __es_remove_extent() can be
      used by es_insert_extent() to remove the old extent entry(-ies) before
      inserting a new one.
      
      3) Rename extent_status_end() to ext4_es_end()
      
      4) ext4_es_can_be_merged() is define to check whether two extents can
      be merged or not.
      
      5) Update and clarified comments.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      06b0c886
  8. 17 1月, 2013 1 次提交
  9. 26 12月, 2012 1 次提交
  10. 09 11月, 2012 3 次提交
  11. 17 8月, 2012 2 次提交
  12. 16 5月, 2012 1 次提交
  13. 19 12月, 2011 1 次提交
  14. 27 10月, 2011 1 次提交
    • E
      ext4: optimize ext4_ext_convert_to_initialized() · 6f91bc5f
      Eric Gouriou 提交于
      This patch introduces a fast path in ext4_ext_convert_to_initialized()
      for the case when the conversion can be performed by transferring
      the newly initialized blocks from the uninitialized extent into
      an adjacent initialized extent. Doing so removes the expensive
      invocations of memmove() which occur during extent insertion and
      the subsequent merge.
      
      In practice this should be the common case for clients performing
      append writes into files pre-allocated via
      fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
      direct IO and when using a suboptimal implementation of memmove()
      (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
      consumption by 32%.
      
      Two new trace points are added to ext4_ext_convert_to_initialized()
      to offer visibility into its operations. No exit trace point has
      been added due to the multiplicity of return points. This can be
      revisited once the upstream cleanup is backported.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6f91bc5f
  15. 10 9月, 2011 1 次提交
  16. 31 7月, 2011 1 次提交
  17. 11 7月, 2011 2 次提交
  18. 08 6月, 2011 1 次提交
  19. 06 6月, 2011 1 次提交
  20. 22 3月, 2011 1 次提交
  21. 09 11月, 2010 1 次提交
  22. 28 10月, 2010 4 次提交
    • T
      ext4,jbd2: convert tracepoints to use major/minor numbers · a269029d
      Theodore Ts'o 提交于
      Unfortunately perf can't deal with anything other than direct structure
      accesses in the TP_printk() section.  It will drop dead when it sees
      jbd2_dev_to_name() in the "print fmt" section of the tracepoint.
      
      Addresses-Google-Bug: 3138508
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a269029d
    • E
      ext4: don't use ext4_allocation_contexts for tracing · 3e1e5f50
      Eric Sandeen 提交于
      Many tracepoints were populating an ext4_allocation_context
      to pass in, but this requires a slab allocation even when
      tracepoints are off.  In fact, 4 of 5 of these allocations
      were only for tracing.  In addition, we were only using a
      small fraction of the 144 bytes of this structure for this
      purpose.
      
      We can do away with all these alloc/frees of the ac and
      simply pass in the bits we care about, instead.
      
      I tested this by turning on tracing and running through
      xfstests on x86_64.  I did not actually do anything with
      the trace output, however.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e1e5f50
    • E
      ext4: fix oops in trace_ext4_mb_release_group_pa · 4d547616
      Eric Sandeen 提交于
      Our QA reported an oops in the ext4_mb_release_group_pa tracing,
      and Josef Bacik pointed out that it was because we may have a
      non-null but uninitialized ac_inode in the allocation context.
      
      I can reproduce it when running xfstests with ext4 tracepoints on, 
      on a CONFIG_SLAB_DEBUG kernel.
      
      We call trace_ext4_mb_release_group_pa from 2 places, 
      ext4_mb_discard_group_preallocations and 
      ext4_mb_discard_lg_preallocations
      
      In both cases we allocate an ac as a container just for tracing (!)
      and never fill in the ac_inode.  There's no reason to be assigning,
      testing, or printing it as far as I can see, so just remove it from
      the tracepoint.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4d547616
    • W
      ext4: avoid null dereference in trace_ext4_mballoc_discard · b853fd36
      Wen Congyang 提交于
      ac->inode is set to null in function ext4_mb_release_group_pa(),
      and then trace_ext4_mballoc_discard(ac) is called, the kernel
      will panic.
      
      BUG: unable to handle kernel NULL pointer dereference at 000000a4
      IP: [<f87e1714>] ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
      *pdpt = 0000000000abd001 *pde = 0000000000000000
      Oops: 0000 [#1] SMP
      
      Pid: 550, comm: flush-8:16 Not tainted 2.6.36-rc1 #1 SE7320EP2/Altos G530
      EIP: 0060:[<f87e1714>] EFLAGS: 00010206 CPU: 1
      EIP is at ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
      EAX: f32ac840 EBX: f3f1cf88 ECX: f32ac840 EDX: 00000000
      ESI: f32ac83c EDI: f880b9d8 EBP: 00000000 ESP: f4b77ae4
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      Process flush-8:16 (pid: 550, ti=f4b76000 task=f613e540 task.ti=f4b76000)
      Call Trace:
       [<f87f5ac1>] ? ext4_mb_release_group_pa+0x121/0x150 [ext4]
       [<f87f8356>] ? ext4_mb_discard_group_preallocations+0x336/0x400 [ext4]
       [<f87fb7f1>] ? ext4_mb_new_blocks+0x3d1/0x4f0 [ext4]
       [<c05a6c5b>] ? __make_request+0x10b/0x440
       [<f87f1fb4>] ? ext4_ext_map_blocks+0x1334/0x1980 [ext4]
       [<c04ac78a>] ? rb_reserve_next_event+0xaa/0x3b0
       [<f87d18d6>] ? ext4_map_blocks+0xd6/0x1d0 [ext4]
       [<f87d2da7>] ? mpage_da_map_blocks+0xc7/0x8a0 [ext4]
       [<c04c8a68>] ? find_get_pages_tag+0x38/0x110
       [<c04d23a5>] ? __pagevec_release+0x15/0x20
       [<f87d3ca5>] ? ext4_da_writepages+0x2b5/0x5d0 [ext4]
       [<c04cfbe0>] ? __writepage+0x0/0x30
       [<c04d0e34>] ? do_writepages+0x14/0x30
       [<c0526600>] ? writeback_single_inode+0xa0/0x240
       [<c0526971>] ? writeback_sb_inodes+0xc1/0x180
       [<c0526ab8>] ? writeback_inodes_wb+0x88/0x140
       [<c0526d7b>] ? wb_writeback+0x20b/0x320
       [<c045aca7>] ? lock_timer_base+0x27/0x50
       [<c0526fe0>] ? wb_do_writeback+0x150/0x190
       [<c05270a8>] ? bdi_writeback_thread+0x88/0x1f0
       [<c043b680>] ? complete+0x40/0x60
       [<c0527020>] ? bdi_writeback_thread+0x0/0x1f0
       [<c0469474>] ? kthread+0x74/0x80
       [<c0469400>] ? kthread+0x0/0x80
       [<c040a23e>] ? kernel_thread_helper+0x6/0x10
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b853fd36
  23. 27 10月, 2010 1 次提交
    • W
      writeback: remove nonblocking/encountered_congestion references · 1b430bee
      Wu Fengguang 提交于
      This removes more dead code that was somehow missed by commit 0d99519e
      (writeback: remove unused nonblocking and congestion checks).  There are
      no behavior change except for the removal of two entries from one of the
      ext4 tracing interface.
      
      The nonblocking checks in ->writepages are no longer used because the
      flusher now prefer to block on get_request_wait() than to skip inodes on
      IO congestion.  The latter will lead to more seeky IO.
      
      The nonblocking checks in ->writepage are no longer used because it's
      redundant with the WB_SYNC_NONE check.
      
      We no long set ->nonblocking in VM page out and page migration, because
      a) it's effectively redundant with WB_SYNC_NONE in current code
      b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
         that would skip some dirty inodes on congestion and page out others, which
         is unfair in terms of LRU age.
      
      Inspired by Christoph Hellwig. Thanks!
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sage Weil <sage@newdream.net>
      Cc: Steve French <sfrench@samba.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b430bee
  24. 27 7月, 2010 1 次提交
  25. 09 6月, 2010 1 次提交
    • D
      writeback: pay attention to wbc->nr_to_write in write_cache_pages · 0b564927
      Dave Chinner 提交于
      If a filesystem writes more than one page in ->writepage, write_cache_pages
      fails to notice this and continues to attempt writeback when wbc->nr_to_write
      has gone negative - this trace was captured from XFS:
      
          wbc_writeback_start: towrt=1024
          wbc_writepage: towrt=1024
          wbc_writepage: towrt=0
          wbc_writepage: towrt=-1
          wbc_writepage: towrt=-5
          wbc_writepage: towrt=-21
          wbc_writepage: towrt=-85
      
      This has adverse effects on filesystem writeback behaviour. write_cache_pages()
      needs to terminate after a certain number of pages are written, not after a
      certain number of calls to ->writepage are made.  This is a regression
      introduced by 17bc6c30 ("vfs: Add
      no_nrwrite_index_update writeback control flag"), but cannot be reverted
      directly due to subsequent bug fixes that have gone in on top of it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b564927
  26. 28 5月, 2010 1 次提交
  27. 17 5月, 2010 2 次提交