提交 · 13a8a7c8c47e542b3cdb45bec3f431f96af79361 · openanolis / cloud-kernel

22 9月, 2009 1 次提交

Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8

由 Yan, Zheng 提交于 9月 21, 2009

The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.

Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

13a8a7c8

19 9月, 2009 1 次提交

Btrfs: search for an allocation hint while filling file COW · b917b7c3

由 Chris Mason 提交于 9月 18, 2009

The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.

This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b917b7c3

12 9月, 2009 7 次提交

Btrfs: zero page past end of inline file items · 93c82d57

由 Chris Mason 提交于 9月 11, 2009

When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page.  If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.

readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.

This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

93c82d57

Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214

由 Chris Mason 提交于 9月 11, 2009

This closes a whole where the page may be written before
the page_mkwrite caller has a chance to dirty it

(thanks to Nick Piggin)
Signed-off-by: NChris Mason <chris.mason@oracle.com>

50a9b214

Btrfs: Fix extent replacment race · a1ed835e

由 Chris Mason 提交于 9月 11, 2009

Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a1ed835e

Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b

由 Chris Mason 提交于 9月 02, 2009

Btrfs writes go through delalloc to the data=ordered code.  This
makes sure that all of the data is on disk before the metadata
that references it.  The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8b62b72b

Btrfs: use a cached state for extent state operations during delalloc · 9655d298

由 Chris Mason 提交于 9月 02, 2009

This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit.  It reduces
one of the biggest causes of rbtree searches in the writeback path.

test_range_bit is also modified to take the cached state as a starting
point while searching.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9655d298

Btrfs: cache values for locking extents · 2c64c53d

由 Chris Mason 提交于 9月 02, 2009

Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.

This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree.  A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.

This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2c64c53d

Btrfs: switch extent_map to a rw lock · 890871be

由 Chris Mason 提交于 9月 02, 2009

There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

890871be

21 8月, 2009 1 次提交

btrfs: fix inode rbtree corruption · 03e860bd

由 From: Nick Piggin 提交于 8月 21, 2009

Node may not be inserted over existing node. This causes inode tree
corruption and I was seeing crashes in inode_tree_del which I can not
reproduce after this patch.

The other way to fix this would be to tie inode lifetime in the rbtree
with inode while not in freeing state. I had a look at this but it is
not so trivial at this point. At least this patch gets things working again.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Acked-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

03e860bd

08 8月, 2009 1 次提交

Btrfs: remove superfluous NULL pointer check in btrfs_rename() · 4baf8c92

由 Bartlomiej Zolnierkiewicz 提交于 8月 07, 2009

This takes care of the following entry from Dan's list:

fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'
Reported-by: NDan Carpenter <error27@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4baf8c92

23 7月, 2009 1 次提交

Btrfs: adjust NULL test · 33c17ad5

由 Julia Lawall 提交于 7月 22, 2009

Move the call to BUG_ON to before the dereference of the tested value.
Signed-off-by: NJulia Lawall <julia@diku.dk>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

33c17ad5

13 7月, 2009 1 次提交

headers: smp_lock.h redux · 405f5571

由 Alexey Dobriyan 提交于 7月 11, 2009

* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

405f5571

03 7月, 2009 2 次提交

Btrfs: honor nodatacow/sum mount options for new files · 94272164

由 Chris Mason 提交于 7月 02, 2009

The btrfs attr patches unconditionally inherited the inode flags field
without honoring nodatacow and nodatasum. This fix makes sure
we properly record the nodatacow/sum mount options in new inodes.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

94272164

Btrfs: account for space we may use in fallocate · a970b0a1

由 Josef Bacik 提交于 6月 27, 2009

Using Eric Sandeen's xfstest for fallocate, you can easily trigger a ENOSPC
panic on btrfs. This is because we do not account for data we may use when
doing the fallocate. This patch fixes the problem by properly reserving space,
and then just freeing it when we are done. The reservation stuff was made with
delalloc in mind, so its a little crude for this case, but it keeps the box
from panicing.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a970b0a1

25 6月, 2009 1 次提交
- A
  Get "no acls for this inode" right, fix shmem breakage · 72c04902
  由 Al Viro 提交于 6月 24, 2009
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  72c04902
24 6月, 2009 1 次提交
- A
  switch btrfs to inode->i_acl · 5affd88a
  由 Al Viro 提交于 6月 08, 2009
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  5affd88a
12 6月, 2009 1 次提交
- C
  btrfs: remove ->write_super and stop maintaining ->s_dirt · 59d697b7
  由 Christoph Hellwig 提交于 4月 27, 2009
```
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  59d697b7
10 6月, 2009 2 次提交

Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION · 6cbff00f

由 Christoph Hellwig 提交于 4月 17, 2009

Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6cbff00f

Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE) · 5d4f98a2

由 Yan Zheng 提交于 6月 10, 2009

This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5d4f98a2

15 5月, 2009 2 次提交

Btrfs: remove some WARN_ONs in the IO failure path · cc7b0c9b

由 Chris Mason 提交于 5月 14, 2009

These debugging WARN_ONs make too much console noise during regular
IO failures. An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cc7b0c9b

Btrfs: init inode ordered_data_close flag properly · 2757495c

由 Chris Mason 提交于 5月 14, 2009

This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2757495c

28 4月, 2009 2 次提交

Btrfs: look for acls during btrfs_read_locked_inode · 46a53cca

由 Chris Mason 提交于 4月 27, 2009

This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

46a53cca

Btrfs: fix acl caching · 7b1a14bb

由 Chris Mason 提交于 4月 27, 2009

Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls.  This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).

This is a modified version of Linus' original patch:

Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.

Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.

Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl.  It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7b1a14bb

27 4月, 2009 2 次提交

C
Btrfs: remove unused btrfs_bit_radix slab · 45c06543
由 Chris Mason 提交于 4月 27, 2009
```
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
45c06543

Btrfs: ratelimit IO error printks · 193f284d

由 Chris Mason 提交于 4月 27, 2009

Btrfs has printks for various IO errors, including bad checksums and
mismatches between what we expect the block headers to contain and what
we actually find on the disk.

Longer term we need a real reporting mechanism for this, but for now
printk is going to have to do.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

193f284d

25 4月, 2009 2 次提交

Btrfs: fix fallocate deadlock on inode extent lock · e980b50c

由 Chris Mason 提交于 4月 24, 2009

The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e980b50c

Btrfs: kill btrfs_cache_create · 9601e3f6

由 Christoph Hellwig 提交于 4月 13, 2009

Just use kmem_cache_create directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9601e3f6

22 4月, 2009 1 次提交

Btrfs: fix btrfs fallocate oops and deadlock · 546888da

由 Chris Mason 提交于 4月 21, 2009

Btrfs fallocate was incorrectly starting a transaction with a lock held
on the extent_io tree for the file, which could deadlock. Strictly
speaking it was using join_transaction which would be safe, but it is better
to move the transaction outside of the lock.

When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
being called on an unlocked buffer. This was triggering an assertion and
oops because the lock is supposed to be held.

The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
been run. btrfs_del_item takes care of dirtying things, so the solution is a
to skip the btrfs_mark_buffer_dirty call in this case.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

546888da

03 4月, 2009 1 次提交

Btrfs: free inode struct when btrfs_new_inode fails · 09771430

由 Shen Feng 提交于 4月 02, 2009

btrfs_new_inode doesn't call iput to free the inode
when it fails.
Signed-off-by: NShen Feng <shen@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

09771430

01 4月, 2009 3 次提交

fs: fix page_mkwrite error cases in core code and btrfs · 56a76f82

由 Nick Piggin 提交于 3月 31, 2009

page_mkwrite is called with neither the page lock nor the ptl held. This
means a page can be concurrently truncated or invalidated out from
underneath it. Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS. A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg. hole punched). Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.

The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors. Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).

This fixes core code, and converts btrfs as a template/example. All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.
Acked-by: NChris Mason <chris.mason@oracle.com>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56a76f82

mm: page_mkwrite change prototype to match fault · c2ec175c

由 Nick Piggin 提交于 3月 31, 2009

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags.  There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg.  virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix.  And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c2ec175c

Btrfs: add extra flushing for renames and truncates · 5a3f23d5

由 Chris Mason 提交于 3月 31, 2009

Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.

This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.

If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.

Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).

For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).

For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5a3f23d5

25 3月, 2009 5 次提交

Btrfs: tree logging unlink/rename fixes · 12fcfd22

由 Chris Mason 提交于 3月 24, 2009

The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log.  This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done.  The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged.  For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2).  After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log.  This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant.  Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir.  After a crash the rm -rf must
be replayed.  This must be able to recurse down the entire
directory tree.  The inode link count fixup code takes care of the
ugly details.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

12fcfd22

Btrfs: readahead checksums during btrfs_finish_ordered_io · 5d13a98f

由 Chris Mason 提交于 3月 13, 2009

This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io. It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5d13a98f

Btrfs: leave btree locks spinning more often · b9473439

由 Chris Mason 提交于 3月 13, 2009

btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b9473439

Btrfs: reduce stack in cow_file_range · 7f366cfe

由 Chris Mason 提交于 3月 12, 2009

The fs/btrfs/inode.c code to run delayed allocation during writout
needed some stack usage optimization.  This is the first pass, it does
the check for compression earlier on, which allows us to do the common
(no compression) case higher up in the call chain.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7f366cfe

Btrfs: reduce stalls during transaction commit · b7ec40d7

由 Chris Mason 提交于 3月 12, 2009

To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.

This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait. This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.

It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish. This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b7ec40d7

21 2月, 2009 1 次提交

Btrfs: add better -ENOSPC handling · 6a63209f

由 Josef Bacik 提交于 2月 20, 2009

This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.
Signed-off-by: NJosef Bacik <jbacik@redhat.com>

6a63209f

13 2月, 2009 1 次提交

Btrfs: remove btrfs_init_path · e00f7308

由 Jeff Mahoney 提交于 2月 12, 2009

btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e00f7308

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功