- 04 2月, 2009 4 次提交
-
-
由 Chris Ball 提交于
Before this patch, new files/dirs would ignore the SGID bit on their parent directory and always be owned by the creating user's uid/gid. Signed-off-by: NChris Ball <cjb@laptop.org> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Every transaction in btrfs creates a new snapshot, and then schedules the snapshot from the last transaction for deletion. Snapshot deletion works by walking down the btree and dropping the reference counts on each btree block during the walk. If if a given leaf or node has a reference count greater than one, the reference count is decremented and the subtree pointed to by that node is ignored. If the reference count is one, walking continues down into that node or leaf, and the references of everything it points to are decremented. The old code would try to work in small pieces, walking down the tree until it found the lowest leaf or node to free and then returning. This was very friendly to the rest of the FS because it didn't have a huge impact on other operations. But it wouldn't always keep up with the rate that new commits added new snapshots for deletion, and it wasn't very optimal for the extent allocation tree because it wasn't finding leaves that were close together on disk and processing them at the same time. This changes things to walk down to a level 1 node and then process it in bulk. All the leaf pointers are sorted and the leaves are dropped in order based on their extent number. The extent allocation tree and commit code are now fast enough for this kind of bulk processing to work without slowing the rest of the FS down. Overall it does less IO and is better able to keep up with snapshot deletions under high load. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Most of the btrfs metadata operations can be protected by a spinlock, but some operations still need to schedule. So far, btrfs has been using a mutex along with a trylock loop, most of the time it is able to avoid going for the full mutex, so the trylock loop is a big performance gain. This commit is step one for getting rid of the blocking locks entirely. btrfs_tree_lock takes a spinlock, and the code explicitly switches to a blocking lock when it starts an operation that can schedule. We'll be able get rid of the blocking locks in smaller pieces over time. Tracing allows us to find the most common cause of blocking, so we can start with the hot spots first. The basic idea is: btrfs_tree_lock() returns with the spin lock held btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in the extent buffer flags, and then drops the spin lock. The buffer is still considered locked by all of the btrfs code. If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops the spin lock and waits on a wait queue for the blocking bit to go away. Much of the code that needs to set the blocking bit finishes without actually blocking a good percentage of the time. So, an adaptive spin is still used against the blocking bit to avoid very high context switch rates. btrfs_clear_lock_blocking() clears the blocking bit and returns with the spinlock held again. btrfs_tree_unlock() can be called on either blocking or spinning locks, it does the right thing based on the blocking bit. ctree.c has a helper function to set/clear all the locked buffers in a path as blocking. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Jim Owens 提交于
Add call to LSM security initialization and save resulting security xattr for new inodes. Add xattr support to symlink inode ops. Set inode->i_op for existing special files. Signed-off-by: Njim owens <jowens@hp.com>
-
- 29 1月, 2009 1 次提交
-
-
由 Chris Mason 提交于
After btrfs_readdir has gone through all the directory items, it sets the directory f_pos to the largest possible int. This way applications that mix readdir with creating new files don't end up in an endless loop finding the new directory items as they go. It was a workaround for a bug in git, but the assumption was that if git could make this looping mistake than it would be a common problem. The largest possible int chosen was INT_LIMIT(typeof(file->f_pos), and it is possible for that to be a larger number than 32 bit glibc expects to come out of readdir. This patches switches that to INT_LIMIT(off_t), which should keep applications happy on 32 and 64 bit machines. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 22 1月, 2009 2 次提交
-
-
由 Yehuda Sadeh 提交于
Now that bmap support is gone, this is the only way to get extent mappings for userland. These are still not valid for IO, but they can tell us if a file has holes or how much fragmentation there is. Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net>
-
由 Chris Mason 提交于
Swapfiles use bmap to build a list of extents belonging to the file, and they assume these extents won't change over the life of the file. They also use resulting list to do IO directly to the block device. This causes problems for btrfs in a few ways: btrfs returns logical block numbers through bmap, and these are not suitable for IO. They might translate to different devices, raid etc. COW means that file block mappings are going to change frequently. Using swapfiles on btrfs will lead to corruption, so we're avoiding the problem for now by dropping bmap support entirely. A later commit will add fiemap support for people that really want to know how a file is laid out. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 21 1月, 2009 2 次提交
-
-
由 Qinghuang Feng 提交于
Merge list_for_each* and list_entry to list_for_each_entry* Signed-off-by: NQinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Huang Weiyi 提交于
Removed unused #include <version.h>'s in btrfs Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 07 1月, 2009 3 次提交
-
-
由 Chris Mason 提交于
None of the checksum verification code schedules, so we can use the faster kmap_atomic Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Checksum verification happens in a helper thread, and there is no need to mess with interrupts. This switches to kmap() instead. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
This patch contains following things. 1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This struct is kmalloced so we want to keep it reasonable. 2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was duplicated code in tree-log.c 3) Remove replay_one_csum. csum items are replayed at the same time as replaying file extents. This guarantees we only replay useful csums. 4) nbytes accounting fix. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 06 1月, 2009 2 次提交
-
-
由 Yan Zheng 提交于
Snapshot creation happens at a specific time during transaction commit. We need to make sure the code called by snapshot creation doesn't wait for the running transaction to commit. This changes btrfs_delete_inode and finish_pending_snaps to use btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks. It would be better if btrfs_delete_inode didn't use the join, but the call path that triggers it is: btrfs_commit_transaction->create_pending_snapshots-> create_pending_snapshot->btrfs_lookup_dentry-> fixup_tree_root_location->btrfs_read_fs_root-> btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput This will be fixed in a later patch by moving the orphan cleanup to the cleaner thread. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 12月, 2008 1 次提交
-
-
由 Chris Mason 提交于
bio_end_io for reads without checksumming on and btree writes were happening without using async thread pools. This means the extent_io.c code had to use spin_lock_irq and friends on the rb tree locks for extent state. There were some irq safe vs unsafe lock inversions between the delallock lock and the extent state locks. This patch gets rid of them by moving all end_io code into the thread pools. To avoid contention and deadlocks between the data end_io processing and the metadata end_io processing yet another thread pool is added to finish off metadata writes. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 16 12月, 2008 2 次提交
-
-
由 Chris Mason 提交于
The delalloc lock doesn't need to have irqs disabled, nobody that changes the number of delalloc bytes in the FS is running with irqs off. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The compression code was using isize to limit the amount of data it sent through zlib. But, it wasn't properly limiting the looping to just the pages inside i_size. The end result was trying to compress too many pages, including those that had not been setup and properly locked down. This made the compression code oops while trying find_get_page on a page that didn't exist. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 12 12月, 2008 2 次提交
-
-
由 Yan Zheng 提交于
Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 09 12月, 2008 2 次提交
-
-
由 Chris Mason 提交于
This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 02 12月, 2008 3 次提交
-
-
由 Chris Mason 提交于
Snapshot and subvolume creation no longer need this helper. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Liu Hui 提交于
The file preallocation code reversed the logic to force nodatacow. This fixes it.
-
- 20 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This fixes compile problems with linux-next Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
While building large bios in writepages, btrfs may end up waiting for other page writeback to finish if WB_SYNC_ALL is used. While it is waiting, the bio it is building has a number of pages with the writeback bit set and they aren't getting to the disk any time soon. This lowers the latencies of writeback in general by sending down the bio being built before waiting for other pages. The bio submission code tries to limit the total number of async bios in flight by waiting when we're over a certain number of async bios. But, the waits are happening while writepages is building bios, and this can easily lead to stalls and other problems for people calling wait_on_page_writeback. The current fix is to let the congestion tests take care of waiting. sync() and others make sure to drain the current async requests to make sure that everything that was pending when the sync was started really get to disk. The code would drain pending requests both before and after submitting a new request. But, if one of the requests is waiting for page writeback to finish, the draining waits might block that page writeback. This changes the draining code to only wait after submitting the bio being processed. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
Subvols and snapshots can now be referenced from any point in the directory tree. We need to maintain back refs for them so we can find lost subvols. Forward refs are added so that we know all of the subvols and snapshots referenced anywhere in the directory tree of a single subvol. This can be used to do recursive snapshotting (but they aren't yet) and it is also used to detect and prevent directory loops when creating new snapshots. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Each subvolume has its own private inode number space, and so we need to fill in different device numbers for each subvolume to avoid confusing applications. This commit puts a struct super_block into struct btrfs_root so it can call set_anon_super() and get a different device number generated for each root. btrfs_rename is changed to prevent renames across subvols. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 13 11月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 11 11月, 2008 2 次提交
-
-
由 Chris Mason 提交于
Simple casting here and there to fix things up. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This makes sure the orig_start field in struct extent_map gets set everywhere the extent_map structs are created or modified. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 11月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
The decompress code doesn't take the logical offset in extent pointer into account. If the logical offset isn't zero, data will be decompressed into wrong pages. The solution used here is to record the starting offset of the extent in the file separately from the logical start of the extent_map struct. This allows us to avoid problems inserting overlapping extents. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 07 11月, 2008 2 次提交
-
-
由 Chris Mason 提交于
When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 01 11月, 2008 1 次提交
-
-
由 Chris Mason 提交于
Make sure we keep page->mapping NULL on the pages we're getting via alloc_page. It gets set so a few of the callbacks can do the right thing, but in general these pages don't have a mapping. Don't try to truncate compressed inline items in btrfs_drop_extents. The whole compressed item must be preserved. Don't try to create multipage inline compressed items. When we try to overwrite just the first page of the file, we would have to read in and recow all the pages after it in the same compressed inline items. For now, only create single page inline items. Make sure we lock pages in the correct order during delalloc. The search into the state tree for delalloc bytes can return bytes before the page we already have locked. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 31 10月, 2008 3 次提交
-
-
由 Yan Zheng 提交于
This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
When dropping middle part of an extent, btrfs_drop_extents truncates the extent at first, then inserts a bookend extent. Since truncation and insertion can't be done atomically, there is a small period that the bookend extent isn't in the tree. This causes problem for functions that search the tree for file extent item. The way to fix this is lock the range of the bookend extent before truncation. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-