提交 · 8c087b5183adab186a298f2d6ed39aefdcae413c · openeuler / Kernel

04 2月, 2009 4 次提交

Btrfs: Handle SGID bit when creating inodes · 8c087b51

由 Chris Ball 提交于 2月 04, 2009

Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.
Signed-off-by: NChris Ball <cjb@laptop.org>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

8c087b51

Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks · bd56b302

由 Chris Mason 提交于 2月 04, 2009

Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bd56b302

Btrfs: Change btree locking to use explicit blocking points · b4ce94de

由 Chris Mason 提交于 2月 04, 2009

Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4ce94de

Btrfs: selinux support · 0279b4cd

由 Jim Owens 提交于 2月 04, 2009

Add call to LSM security initialization and save
resulting security xattr for new inodes.

Add xattr support to symlink inode ops.

Set inode->i_op for existing special files.
Signed-off-by: Njim owens <jowens@hp.com>

0279b4cd

29 1月, 2009 1 次提交

Btrfs: fix readdir on 32 bit machines · 89f135d8

由 Chris Mason 提交于 1月 28, 2009

After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int.  This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.

It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.

The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.

This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

89f135d8

22 1月, 2009 2 次提交

Btrfs: fiemap support · 1506fcc8

由 Yehuda Sadeh 提交于 1月 21, 2009

Now that bmap support is gone, this is the only way to get extent
mappings for userland. These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.
Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>

1506fcc8

Btrfs: stop providing a bmap operation to avoid swapfile corruptions · 35054394

由 Chris Mason 提交于 1月 21, 2009

Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.

This causes problems for btrfs in a few ways:

btrfs returns logical block numbers through bmap, and these are not suitable
for IO.  They might translate to different devices, raid etc.

COW means that file block mappings are going to change frequently.

Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely.  A later commit
will add fiemap support for people that really want to know how
a file is laid out.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

35054394

21 1月, 2009 2 次提交

Btrfs: simplify iteration codes · c6e30871

由 Qinghuang Feng 提交于 1月 21, 2009

Merge list_for_each* and list_entry to list_for_each_entry*
Signed-off-by: NQinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c6e30871

Btrfs: removed unused #include <version.h>'s · 7eaebe7d

由 Huang Weiyi 提交于 1月 21, 2009

Removed unused #include <version.h>'s in btrfs
Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7eaebe7d

07 1月, 2009 3 次提交

Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook · 9ab86c8e

由 Chris Mason 提交于 1月 07, 2009

None of the checksum verification code schedules, so we can use the faster
kmap_atomic
Signed-off-by: NChris Mason <chris.mason@oracle.com>

9ab86c8e

Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies · cc7172de

由 Chris Mason 提交于 1月 06, 2009

Checksum verification happens in a helper thread, and there is no
need to mess with interrupts.  This switches to kmap() instead.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cc7172de

Btrfs: tree logging checksum fixes · 07d400a6

由 Yan Zheng 提交于 1月 06, 2009

This patch contains following things.

1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
struct is kmalloced so we want to keep it reasonable.

2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
duplicated code in tree-log.c

3) Remove replay_one_csum. csum items are replayed at the same time as
   replaying file extents. This guarantees we only replay useful csums.

4) nbytes accounting fix.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

07d400a6

06 1月, 2009 2 次提交

Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation · 180591bc

由 Yan Zheng 提交于 1月 06, 2009

Snapshot creation happens at a specific time during transaction commit.  We
need to make sure the code called by snapshot creation doesn't wait
for the running transaction to commit.

This changes btrfs_delete_inode and finish_pending_snaps to use
btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.

It would be better if btrfs_delete_inode didn't use the join, but the
call path that triggers it is:

btrfs_commit_transaction->create_pending_snapshots->
create_pending_snapshot->btrfs_lookup_dentry->
fixup_tree_root_location->btrfs_read_fs_root->
btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput

This will be fixed in a later patch by moving the orphan cleanup to the
cleaner thread.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

180591bc

Btrfs: Fix checkpatch.pl warnings · d397712b

由 Chris Mason 提交于 1月 05, 2009

There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d397712b

18 12月, 2008 1 次提交

Btrfs: shift all end_io work to thread pools · cad321ad

由 Chris Mason 提交于 12月 17, 2008

bio_end_io for reads without checksumming on and btree writes were
happening without using async thread pools.  This means the extent_io.c
code had to use spin_lock_irq and friends on the rb tree locks for
extent state.

There were some irq safe vs unsafe lock inversions between the delallock
lock and the extent state locks.  This patch gets rid of them by moving
all end_io code into the thread pools.

To avoid contention and deadlocks between the data end_io processing and the
metadata end_io processing yet another thread pool is added to finish
off metadata writes.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

cad321ad

16 12月, 2008 2 次提交

Btrfs: Don't use spin*lock_irq for the delalloc lock · 75eff68e

由 Chris Mason 提交于 12月 15, 2008

The delalloc lock doesn't need to have irqs disabled, nobody that
changes the number of delalloc bytes in the FS is running with irqs off.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

75eff68e

Btrfs: Fix compressed writes on truncated pages · 42dc7bab

由 Chris Mason 提交于 12月 15, 2008

The compression code was using isize to limit the amount of data it
sent through zlib.  But, it wasn't properly limiting the looping to
just the pages inside i_size.  The end result was trying to compress
too many pages, including those that had not been setup and properly locked
down.  This made the compression code oops while trying find_get_page on a
page that didn't exist.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

42dc7bab

12 12月, 2008 2 次提交

Btrfs: fix nodatasum handling in balancing code · 17d217fe

由 Yan Zheng 提交于 12月 12, 2008

Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.

1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.

2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.

3) update data relocation code to properly handle the case
of checksum missing.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

17d217fe

Btrfs: fix leaking block group on balance · d2fb3437

由 Yan Zheng 提交于 12月 11, 2008

The block group structs are referenced in many different
places, and it's not safe to free while balancing.  So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

d2fb3437

09 12月, 2008 2 次提交

Btrfs: Add inode sequence number for NFS and reserved space in a few structs · c3027eb5

由 Chris Mason 提交于 12月 08, 2008

This adds a sequence number to the btrfs inode that is increased on
every update.  NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.

While we're here, this also:

Puts reserved space into the super block and inode

Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID.  For now
the log root transid is always zero, but that'll get fixed.

Adds a starting offset to the dev_item.  This will let us do better
alignment calculations if we know the start of a partition on the disk.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

c3027eb5

Btrfs: move data checksumming into a dedicated tree · d20f7043

由 Chris Mason 提交于 12月 08, 2008

Btrfs stores checksums for each data block.  Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block.  This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file.  When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data.  It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk.  This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct.  This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree.  This allows us to index the checksums by phyiscal extent
start and length.  It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed.  It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent.  This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d20f7043

02 12月, 2008 3 次提交

C
Btrfs: delete unused function: btrfs_invalidate_dcache_root · 4022abf4
由 Chris Mason 提交于 12月 02, 2008
```
Snapshot and subvolume creation no longer need this helper.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
4022abf4

Btrfs: make things static and include the right headers · b2950863

由 Christoph Hellwig 提交于 12月 02, 2008

Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b2950863

L
Btrfs: Fix cow semantic in run_delalloc_nocow() · ce397c06
由 Liu Hui 提交于 12月 01, 2008
```
The file preallocation code reversed the logic to force nodatacow.
This fixes it.
```
ce397c06

20 11月, 2008 3 次提交

Btrfs: compat code fixes · 4b4e25f2

由 Chris Mason 提交于 11月 20, 2008

The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels.  This commit makes the standalone tree
work with 2.6.27
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4b4e25f2

Btrfs: Use current_fsuid/gid · 79683f2d

由 Chris Mason 提交于 11月 19, 2008

This fixes compile problems with linux-next
Signed-off-by: NChris Mason <chris.mason@oracle.com>

79683f2d

Btrfs: Avoid writeback stalls · d2c3f4f6

由 Chris Mason 提交于 11月 19, 2008

While building large bios in writepages, btrfs may end up waiting
for other page writeback to finish if WB_SYNC_ALL is used.

While it is waiting, the bio it is building has a number of pages with the
writeback bit set and they aren't getting to the disk any time soon. This
lowers the latencies of writeback in general by sending down the bio being
built before waiting for other pages.

The bio submission code tries to limit the total number of async bios in
flight by waiting when we're over a certain number of async bios. But,
the waits are happening while writepages is building bios, and this can easily
lead to stalls and other problems for people calling wait_on_page_writeback.

The current fix is to let the congestion tests take care of waiting.

sync() and others make sure to drain the current async requests to make
sure that everything that was pending when the sync was started really get
to disk. The code would drain pending requests both before and after
submitting a new request.

But, if one of the requests is waiting for page writeback to finish,
the draining waits might block that page writeback. This changes the
draining code to only wait after submitting the bio being processed.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

d2c3f4f6

18 11月, 2008 3 次提交

Btrfs: Add backrefs and forward refs for subvols and snapshots · 0660b5af

由 Chris Mason 提交于 11月 17, 2008

Subvols and snapshots can now be referenced from any point in the directory
tree.  We need to maintain back refs for them so we can find lost
subvols.

Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol.  This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

0660b5af

Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160

由 Chris Mason 提交于 11月 17, 2008

Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.

This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.

btrfs_rename is changed to prevent renames across subvols.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3394e160

Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c

由 Chris Mason 提交于 11月 17, 2008

Before, all snapshots and subvolumes lived in a single flat directory.  This
was awkward and confusing because the single flat directory was only writable
with the ioctls.

This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree.  This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.

The subvol ioctl does:

btrfsctl -S subvol_name parent_dir

After the ioctl is done subvol_name lives inside parent_dir.

The snapshot ioctl does:

btrfsctl -s path_for_snapshot root_to_snapshot

path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
into directory and basename components.

root_to_snapshot can be any file or directory in the FS.  The snapshot
is taken of the entire root where that file lives.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3de4586c

13 11月, 2008 1 次提交

Btrfs: mount ro and remount support · c146afad

由 Yan Zheng 提交于 11月 12, 2008

This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

c146afad

11 11月, 2008 2 次提交

C
Btrfs: Fix compile warnings on 32 bit machines · 5b050f04
由 Chris Mason 提交于 11月 11, 2008
```
Simple casting here and there to fix things up.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
```
5b050f04

Btrfs: Fix usage of struct extent_map->orig_start · 445a6944

由 Chris Mason 提交于 11月 10, 2008

This makes sure the orig_start field in struct extent_map gets set
everywhere the extent_map structs are created or modified.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

445a6944

10 11月, 2008 1 次提交

Btrfs: Fix csum error for compressed data · ff5b7ee3

由 Yan Zheng 提交于 11月 10, 2008

The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.

The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

ff5b7ee3

07 11月, 2008 2 次提交

Btrfs: Optimize compressed writeback and reads · 771ed689

由 Chris Mason 提交于 11月 06, 2008

When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time.  Right now this is just compression.  The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits.  It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

771ed689

Btrfs: Add ordered async work queues · 4a69a410

由 Chris Mason 提交于 11月 06, 2008

Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression.  These work well,
but they make it difficult to keep IO order intact.

A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel.  Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.

The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.

This patch introduces an explicitly ordered work queue.  As work structs
are placed into the queue they are put onto the tail of a list.  They have
three callbacks:

->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)

The work struct has three callbacks.  The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.

Every time a work struct completes, the list is checked to see if the head
is marked as done.  If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup.  Then we loop back and check the head of the list again.

This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4a69a410

01 11月, 2008 1 次提交

Btrfs: Compression corner fixes · 70b99e69

由 Chris Mason 提交于 10月 31, 2008

Make sure we keep page->mapping NULL on the pages we're getting
via alloc_page.  It gets set so a few of the callbacks can do the right
thing, but in general these pages don't have a mapping.

Don't try to truncate compressed inline items in btrfs_drop_extents.
The whole compressed item must be preserved.

Don't try to create multipage inline compressed items.  When we try to
overwrite just the first page of the file, we would have to read in and recow
all the pages after it in the same compressed inline items.  For now, only
create single page inline items.

Make sure we lock pages in the correct order during delalloc.  The
search into the state tree for delalloc bytes can return bytes before
the page we already have locked.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

70b99e69

31 10月, 2008 3 次提交

Btrfs: Add fallocate support v2 · d899e052

由 Yan Zheng 提交于 10月 30, 2008

This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

d899e052

Btrfs: update nodatacow code v2 · 80ff3856

由 Yan Zheng 提交于 10月 30, 2008

This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

80ff3856

Btrfs: Fix bookend extent race v2 · 6643558d

由 Yan Zheng 提交于 10月 30, 2008

When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.

Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.
Signed-off-by: NYan Zheng <zheng.yan@oracle.com>

6643558d

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功