提交 · 2e00c97e2c1d2ffc9e26252ca26b237678b0b772 · OpenHarmony / kernel_linux

08 8月, 2009 2 次提交

由 Christoph Hellwig 提交于 8月 07, 2009

When we want to tear down an inode that lost the add to the cache race
in XFS we must not call into ->destroy_inode because that would delete
the inode that won the race from the inode cache radix tree.

This patch provides the __destroy_inode helper needed to fix this,
the actual fix will be in th next patch. As XFS was the only reason
destroy_inode was exported we shift the export to the new __destroy_inode.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Sandeen <sandeen@sandeen.net>

2e00c97e

vfs: fix inode_init_always calling convention · 54e34621

由 Christoph Hellwig 提交于 8月 07, 2009

Currently inode_init_always calls into ->destroy_inode if the additional
initialization fails. That's not only counter-intuitive because
inode_init_always did not allocate the inode structure, but in case of
XFS it's actively harmful as ->destroy_inode might delete the inode from
a radix-tree that has never been added. This in turn might end up
deleting the inode for the same inum that has been instanciated by
another process and cause lots of cause subtile problems.

Also in the case of re-initializing a reclaimable inode in XFS it would
free an inode we still want to keep alive.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Sandeen <sandeen@sandeen.net>

54e34621

04 8月, 2009 3 次提交

S
[CIFS] Update readme to reflect forceuid mount parms · d098564f
由 Steve French 提交于 8月 04, 2009
```
Signed-off-by: NSteve French <sfrench@us.ibm.com>
```
d098564f

cifs: Read buffer overflow · 24e2fb61

由 Roel Kluin 提交于 8月 02, 2009

Check whether index is within bounds before testing the element.
Acked-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

24e2fb61

cifs: show noforceuid/noforcegid mount options (try #2) · 4486d6ed

由 Jeff Layton 提交于 8月 03, 2009

Since forceuid is the default, we now need to show when it's disabled.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

4486d6ed

02 8月, 2009 2 次提交

nilfs2: fix missing unlock in error path of nilfs_mdt_write_page · 01a261e0

由 Ryusuke Konishi 提交于 8月 02, 2009

This adds a missing unlock of nilfs->ns_writer_mutex in
nilfs_mdt_write_page() function.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

01a261e0

cifs: reinstate original behavior when uid=/gid= options are specified · 9b9d6b24

由 Jeff Layton 提交于 7月 31, 2009

This patch fixes the regression reported here:

http://bugzilla.kernel.org/show_bug.cgi?id=13861

commit 4ae1507f changed the default
behavior when the uid= or gid= option was specified for a mount. The
existing behavior was to always clobber the ownership information
provided by the server when these options were specified. The above
commit changed this behavior so that these options simply provided
defaults when the server did not provide this information (unless
"forceuid" or "forcegid" were specified)

This patch reverts this change so that the default behavior is restored.
It also adds "noforceuid" and "noforcegid" options to make it so that
ownership information from the server is preserved, even when the mount
has uid= or gid= options specified.

It also adds a couple of printk notices that pop up when forceuid or
forcegid options are specified without a uid= or gid= option.
Reported-by: NTom Chiverton <bugzilla.kernel.org@falkensweb.com>
Reviewed-by: NShirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

9b9d6b24

01 8月, 2009 1 次提交

nilfs2: fix oops due to inconsistent state in page with discrete b-tree nodes · a9777845

由 Ryusuke Konishi 提交于 7月 28, 2009

Andrea Gelmini gave me a report that a kernel oops hit on a nilfs
filesystem with a 1KB block size when doing rsync.

This turned out to be caused by an inconsistency of dirty state
between a page and its buffers storing b-tree node blocks.

If the page had multiple buffers split over multiple logs, and if the
logs were written at a time, a dirty flag remained in the page even
every dirty flag in the buffers was cleared.

This will fix the failure by dropping the dirty flag properly for
pages with the discrete multiple b-tree nodes.
Reported-by: NAndrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: NAndrea Gelmini <andrea.gelmini@gmail.com>
Cc: stable@kernel.org

a9777845

31 7月, 2009 2 次提交

xfs: bump up nr_to_write in xfs_vm_writepage · c8a4051c

由 Eric Sandeen 提交于 7月 31, 2009

VM calculation for nr_to_write seems off.  Bump it way
up, this gets simple streaming writes zippy again.
To be reviewed again after Jens' writeback changes.
Signed-off-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: NFelix Blyakher <felixb@sgi.com>
Signed-off-by: NFelix Blyakher <felixb@sgi.com>

c8a4051c

xfs: reduce bmv_count in xfs_vn_fiemap · 97db39a1

由 Eric Sandeen 提交于 7月 26, 2009

commit 6321e3ed caused
the full bmv_count's worth of getbmapx structures to get
allocated; telling it to do MAXEXTNUM was a bit insane,
resulting in ENOMEM every time.

Chop it down to something reasonable, the number of slots
in the caller's input buffer.  If this is too large the
caller may get ENOMEM but the reason should not be a
mystery, and they can try again with something smaller.

We add 1 to the value because in the normal getbmap
world, bmv_count includes the header and xfs_getbmap does:

        nex = bmv->bmv_count - 1;
        if (nex <= 0)
                return XFS_ERROR(EINVAL);
Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
Reviewed-by: NOlaf Weber <olaf@sgi.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NFelix Blyakher <felixb@sgi.com>

97db39a1

30 7月, 2009 14 次提交

quota: Silence lockdep on quota_on · dee86565

由 Jan Kara 提交于 7月 22, 2009

Commit d01730d7 didn't completely fix
the problem since we still take dqio_mutex and i_mutex in the wrong
order. Move taking of i_mutex further down (luckily it's needed only
for updating inode flags) below where dqio_mutex is taken.
Tested-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NJan Kara <jack@suse.cz>

dee86565

udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks · 4bf17af0

由 Jan Kara 提交于 7月 14, 2009

VAT inode is located in the last block recorded block of the medium. When the
drive errorneously reports number of recorded blocks, we failed to load the VAT
inode and thus mount the medium. This patch makes kernel try to read VAT inode
from the last block of the device if it is different from the last recorded
block.
Signed-off-by: NJan Kara <jack@suse.cz>

4bf17af0

Btrfs: be more polite in the async caching threads · f36f3042

由 Chris Mason 提交于 7月 30, 2009

The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall.  This
releases the semaphore more often when a transaction commit is
in progress.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f36f3042

Btrfs: preserve commit_root for async caching · 276e680d

由 Yan Zheng 提交于 7月 30, 2009

The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.

During a commit, we have a loop where we update the extent allocation
tree root.  We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.

Right now the commit_root pointer is changed inside this loop.  It
is more correct to change the commit_root pointer only after all the
looping is done.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

276e680d

GFS2: remove dcache entries for remote deleted inodes · b94a170e

由 Benjamin Marzinski 提交于 7月 23, 2009

When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

b94a170e

GFS2: Fix incorrent statfs consistency check · 6b946170

由 Benjamin Marzinski 提交于 7月 10, 2009

Since both linked and unlinked inodes are counted by rgd->rd_dinodes, It
makes no sense to count them with the used data blocks (first check that
I changed), it makes sense to count them with the linked inodes (second
check), and it makes no sense to care if there are more unlinked inodes
than linked ones. This fixes these errors.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

6b946170

GFS2: Don't put unlikely reclaim candidates on the reclaim list. · 8ff22a6f

由 Benjamin Marzinski 提交于 7月 10, 2009

GFS2 was placing far too many glocks on the reclaim list that were not good
candidates for freeing up from cache. These locks would sit there and
repeatedly get scanned to see if they could be reclaimed, wasting a lot
of time when there was memory pressure. This fix does more checks on the
locks to see if they are actually likely to be removable from cache.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

8ff22a6f

GFS2: Don't try and dealloc own inode · 1e19a195

由 Steven Whitehouse 提交于 7月 10, 2009

When searching for unlinked, but still allocated inodes during block
allocation, avoid the block relating to the inode that is doing the
allocation. This fixes a hang caused when an unlinked, but still
open, inode tries to allocate some more blocks and lands up
finding itself during the search for deallocatable inodes.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

1e19a195

GFS2: Fix panic in glock memory shrinker · a51b56ff

由 Benjamin Marzinski 提交于 6月 30, 2009

It is possible for gfs2_shrink_glock_memory() to check a glock for
demotion
that's in the process of being freed by gfs2_glock_put().  In this case,
gfs2_shrink_glock_memory() will acquire a new reference to this glock,
and
then try to free the glock itself when it drops the refernce.  To solve
this, gfs2_shrink_glock_memory() just needs to check if the glock is in
the process of being freed, and if so skip it without ever unlocking the
lru_lock.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Acked-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

a51b56ff

GFS2: keep statfs info in sync on grows · 1946f70a

由 Benjamin Marzinski 提交于 6月 25, 2009

GFS2 wasn't syncing its statfs info on grows. This causes a problem
when you grow the filesystem on multiple nodes. GFS2 would calculate
the new space based on the resource groups (which are always current),
and then assume that the filesystem had grown the from the existing
statfs size. If you grew the filesystem on two different nodes in a
short time, the second node wouldn't see the statfs size change from the
first node, and would assume that it was grown by a larger amount than
it was. When all these changes were synced out, the total fileystem
size would be incorrect (the first grow would be counted twice).

This patch syncs makes GFS2 read in the statfs changes from disk before
a grow, and write them out after the grow, while the master statfs inode
is locked.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

1946f70a

GFS2: Shrink the shrinker · 2163b1e6

由 Steven Whitehouse 提交于 6月 25, 2009

This patch removes some of the special cases that the shrinker
was trying to deal with. As a result we leave fewer items on
the list and none at all which cannot be demoted. This makes
the list scanning more efficient and solves some issues seen
with large numbers of inodes.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

2163b1e6

S
[CIFS] Updates fs/cifs/CHANGES · 5bd9052d
由 Steve French 提交于 7月 30, 2009
```
Signed-off-by: NSteve French <sfrench@us.ibm.com>
```
5bd9052d

fs/ramfs/file-nommu.c needs include/linux/sched.h · 5c805365

由 Catalin Marinas 提交于 7月 29, 2009

This file makes use of various macros defined in files like asm/current.h
or asm-generic/resource.h.  All these files can be included via sched.h.
The building of the !MMU ARM kernel (with additional patches) fails
without this change.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c805365

PM / Hibernate: Replace bdget call with simple atomic_inc of i_count · dddac6a7

由 Alan Jenkins 提交于 7月 29, 2009

Create bdgrab(). This function copies an existing reference to a
block_device. It is safe to call from any context.

Hibernation code wishes to copy a reference to the active swap device.
Right now it calls bdget() under a spinlock, but this is wrong because
bdget() can sleep. It doesn't need a full bdget() because we already
hold a reference to active swap devices (and the spinlock protects
against swapoff).

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827Signed-off-by: NAlan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

dddac6a7

29 7月, 2009 3 次提交

eCryptfs: parse_tag_3_packet check tag 3 packet encrypted key size · f151cd2c

由 Ramon de Carvalho Valle 提交于 7月 28, 2009

The parse_tag_3_packet function does not check if the tag 3 packet contains a
encrypted key size larger than ECRYPTFS_MAX_ENCRYPTED_KEY_BYTES.
Signed-off-by: NRamon de Carvalho Valle <ramon@risesecurity.org>
[tyhicks@linux.vnet.ibm.com: Added printk newline and changed goto to out_free]
Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f151cd2c

eCryptfs: Check Tag 11 literal data buffer size · 6352a293

由 Tyler Hicks 提交于 7月 28, 2009

Tag 11 packets are stored in the metadata section of an eCryptfs file to
store the key signature(s) used to encrypt the file encryption key.
After extracting the packet length field to determine the key signature
length, a check is not performed to see if the length would exceed the
key signature buffer size that was passed into parse_tag_11_packet().

Thanks to Ramon de Carvalho Valle for finding this bug using fsfuzzer.
Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6352a293

sysfs: fix hardlink count on device_move · 0f58b445

由 Peter Oberparleiter 提交于 7月 14, 2009

Update directory hardlink count when moving kobjects to a new parent.
Fixes the following problem which occurs when several devices are
moved to the same parent and then unregistered:

> ls -laF /sys/devices/css0/defunct/
> total 0
> drwxr-xr-x 4294967295 root root    0 2009-07-14 17:02 ./
> drwxr-xr-x        114 root root    0 2009-07-14 17:02 ../
> drwxr-xr-x          2 root root    0 2009-07-14 17:01 power/
> -rw-r--r--          1 root root 4096 2009-07-14 17:01 uevent
Signed-off-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: stable <stable@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

0f58b445

28 7月, 2009 4 次提交

Btrfs: Fix async caching interaction with unmount · f25784b3

由 Yan Zheng 提交于 7月 28, 2009

- don't stop the caching thread until btrfs_commit_super return.

- if caching is interrupted by umount, set last to (u64)-1.
  otherwise the un-scanned range of block group will be considered
  as free extent.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f25784b3

cifs: fix error handling in mount-time DFS referral chasing code · 7b91e266

由 Jeff Layton 提交于 7月 23, 2009

If the referral is malformed or the hostname can't be resolved, then
the current code generates an oops. Fix it to handle these errors
gracefully.
Reported-by: NSandro Mathys <sm@sandro-mathys.ch>
Acked-by: NIgor Mammedov <niallain@gmail.com>
CC: Stable <stable@kernel.org>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

7b91e266

Btrfs: change how we unpin extents · 68b38550

由 Josef Bacik 提交于 7月 27, 2009

We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

68b38550

Btrfs: Correct redundant test in add_inode_ref · 631c07c8

由 Julia Lawall 提交于 7月 27, 2009

dir has already been tested.  It seems that this test should be on the
recently returned value inode.

A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)
Signed-off-by: NJulia Lawall <julia@diku.dk>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

631c07c8

25 7月, 2009 4 次提交

Btrfs: find smallest available device extent during chunk allocation · 9779b72f

由 Chris Mason 提交于 7月 24, 2009

Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b.  The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.

But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent.  It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.

This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9779b72f

Btrfs: clear all space_info->full after removing a block group · 283bb197

由 Chris Mason 提交于 7月 24, 2009

Btrfs allocates individual extents from block groups, and each
block group has a specific type.  It may hold metadata, data
mirrored or striped etc.

When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups.  Once a block group is freed, the space it was
using on the device may be available for use by new block groups.

btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.

This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

283bb197

Btrfs: make flushoncommit mount option correctly wait on ordered_extents · ebecd3d9

由 Sage Weil 提交于 7月 24, 2009

The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.

So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.
Signed-off-by: NSage Weil <sage@newdream.net>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ebecd3d9

Btrfs: Avoid delayed reference update looping · d717aa1d

由 Yan Zheng 提交于 7月 24, 2009

btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.

There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d717aa1d

24 7月, 2009 5 次提交

Btrfs: Fix ordering of key field checks in btrfs_previous_item · 0a4eefbb

由 Yan Zheng 提交于 7月 24, 2009

Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0a4eefbb

Btrfs: find_free_dev_extent doesn't handle holes at the start of the device · 1fcbac58

由 Yan Zheng 提交于 7月 24, 2009

find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1fcbac58

Btrfs: Remove code duplication in comp_keys · 20736aba

由 Diego Calleja 提交于 7月 24, 2009

comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.
Signed-off-by: NDiego Calleja <diegocg@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

20736aba

Btrfs: async block group caching · 817d52f8

由 Josef Bacik 提交于 7月 13, 2009

This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner.  Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation.  If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching.  This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group.  This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached.  This drastically reduces the amount of time it takes to
cache the block groups.  Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents.  Thank you,
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

817d52f8

Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081

由 Josef Bacik 提交于 7月 13, 2009

Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

96303081

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年