- 06 1月, 2009 2 次提交
-
-
由 Yan Zheng 提交于
Snapshot creation happens at a specific time during transaction commit. We need to make sure the code called by snapshot creation doesn't wait for the running transaction to commit. This changes btrfs_delete_inode and finish_pending_snaps to use btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks. It would be better if btrfs_delete_inode didn't use the join, but the call path that triggers it is: btrfs_commit_transaction->create_pending_snapshots-> create_pending_snapshot->btrfs_lookup_dentry-> fixup_tree_root_location->btrfs_read_fs_root-> btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput This will be fixed in a later patch by moving the orphan cleanup to the cleaner thread. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 12月, 2008 1 次提交
-
-
由 Chris Mason 提交于
bio_end_io for reads without checksumming on and btree writes were happening without using async thread pools. This means the extent_io.c code had to use spin_lock_irq and friends on the rb tree locks for extent state. There were some irq safe vs unsafe lock inversions between the delallock lock and the extent state locks. This patch gets rid of them by moving all end_io code into the thread pools. To avoid contention and deadlocks between the data end_io processing and the metadata end_io processing yet another thread pool is added to finish off metadata writes. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 16 12月, 2008 2 次提交
-
-
由 Chris Mason 提交于
The delalloc lock doesn't need to have irqs disabled, nobody that changes the number of delalloc bytes in the FS is running with irqs off. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The compression code was using isize to limit the amount of data it sent through zlib. But, it wasn't properly limiting the looping to just the pages inside i_size. The end result was trying to compress too many pages, including those that had not been setup and properly locked down. This made the compression code oops while trying find_get_page on a page that didn't exist. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 12 12月, 2008 2 次提交
-
-
由 Yan Zheng 提交于
Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 09 12月, 2008 2 次提交
-
-
由 Chris Mason 提交于
This adds a sequence number to the btrfs inode that is increased on every update. NFS will be able to use that to detect when an inode has changed, without relying on inaccurate time fields. While we're here, this also: Puts reserved space into the super block and inode Adds a log root transid to the super so we can pick the newest super based on the fsync log as well as the main transaction ID. For now the log root transid is always zero, but that'll get fixed. Adds a starting offset to the dev_item. This will let us do better alignment calculations if we know the start of a partition on the disk. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs stores checksums for each data block. Until now, they have been stored in the subvolume trees, indexed by the inode that is referencing the data block. This means that when we read the inode, we've probably read in at least some checksums as well. But, this has a few problems: * The checksums are indexed by logical offset in the file. When compression is on, this means we have to do the expensive checksumming on the uncompressed data. It would be faster if we could checksum the compressed data instead. * If we implement encryption, we'll be checksumming the plain text and storing that on disk. This is significantly less secure. * For either compression or encryption, we have to get the plain text back before we can verify the checksum as correct. This makes the raid layer balancing and extent moving much more expensive. * It makes the front end caching code more complex, as we have touch the subvolume and inodes as we cache extents. * There is potentitally one copy of the checksum in each subvolume referencing an extent. The solution used here is to store the extent checksums in a dedicated tree. This allows us to index the checksums by phyiscal extent start and length. It means: * The checksum is against the data stored on disk, after any compression or encryption is done. * The checksum is stored in a central location, and can be verified without following back references, or reading inodes. This makes compression significantly faster by reducing the amount of data that needs to be checksummed. It will also allow much faster raid management code in general. The checksums are indexed by a key with a fixed objectid (a magic value in ctree.h) and offset set to the starting byte of the extent. This allows us to copy the checksum items into the fsync log tree directly (or any other tree), without having to invent a second format for them. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 02 12月, 2008 3 次提交
-
-
由 Chris Mason 提交于
Snapshot and subvolume creation no longer need this helper. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Liu Hui 提交于
The file preallocation code reversed the logic to force nodatacow. This fixes it.
-
- 20 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This fixes compile problems with linux-next Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
While building large bios in writepages, btrfs may end up waiting for other page writeback to finish if WB_SYNC_ALL is used. While it is waiting, the bio it is building has a number of pages with the writeback bit set and they aren't getting to the disk any time soon. This lowers the latencies of writeback in general by sending down the bio being built before waiting for other pages. The bio submission code tries to limit the total number of async bios in flight by waiting when we're over a certain number of async bios. But, the waits are happening while writepages is building bios, and this can easily lead to stalls and other problems for people calling wait_on_page_writeback. The current fix is to let the congestion tests take care of waiting. sync() and others make sure to drain the current async requests to make sure that everything that was pending when the sync was started really get to disk. The code would drain pending requests both before and after submitting a new request. But, if one of the requests is waiting for page writeback to finish, the draining waits might block that page writeback. This changes the draining code to only wait after submitting the bio being processed. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
Subvols and snapshots can now be referenced from any point in the directory tree. We need to maintain back refs for them so we can find lost subvols. Forward refs are added so that we know all of the subvols and snapshots referenced anywhere in the directory tree of a single subvol. This can be used to do recursive snapshotting (but they aren't yet) and it is also used to detect and prevent directory loops when creating new snapshots. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Each subvolume has its own private inode number space, and so we need to fill in different device numbers for each subvolume to avoid confusing applications. This commit puts a struct super_block into struct btrfs_root so it can call set_anon_super() and get a different device number generated for each root. btrfs_rename is changed to prevent renames across subvols. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Before, all snapshots and subvolumes lived in a single flat directory. This was awkward and confusing because the single flat directory was only writable with the ioctls. This commit changes the ioctls to create subvols and snapshots at any point in the directory tree. This requires making separate ioctls for snapshot and subvol creation instead of a combining them into one. The subvol ioctl does: btrfsctl -S subvol_name parent_dir After the ioctl is done subvol_name lives inside parent_dir. The snapshot ioctl does: btrfsctl -s path_for_snapshot root_to_snapshot path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up into directory and basename components. root_to_snapshot can be any file or directory in the FS. The snapshot is taken of the entire root where that file lives. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 13 11月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 11 11月, 2008 2 次提交
-
-
由 Chris Mason 提交于
Simple casting here and there to fix things up. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This makes sure the orig_start field in struct extent_map gets set everywhere the extent_map structs are created or modified. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 11月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
The decompress code doesn't take the logical offset in extent pointer into account. If the logical offset isn't zero, data will be decompressed into wrong pages. The solution used here is to record the starting offset of the extent in the file separately from the logical start of the extent_map struct. This allows us to avoid problems inserting overlapping extents. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 07 11月, 2008 2 次提交
-
-
由 Chris Mason 提交于
When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 01 11月, 2008 1 次提交
-
-
由 Chris Mason 提交于
Make sure we keep page->mapping NULL on the pages we're getting via alloc_page. It gets set so a few of the callbacks can do the right thing, but in general these pages don't have a mapping. Don't try to truncate compressed inline items in btrfs_drop_extents. The whole compressed item must be preserved. Don't try to create multipage inline compressed items. When we try to overwrite just the first page of the file, we would have to read in and recow all the pages after it in the same compressed inline items. For now, only create single page inline items. Make sure we lock pages in the correct order during delalloc. The search into the state tree for delalloc bytes can return bytes before the page we already have locked. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 31 10月, 2008 6 次提交
-
-
由 Yan Zheng 提交于
This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
When dropping middle part of an extent, btrfs_drop_extents truncates the extent at first, then inserts a bookend extent. Since truncation and insertion can't be done atomically, there is a small period that the bookend extent isn't in the tree. This causes problem for functions that search the tree for file extent item. The way to fix this is lock the range of the bookend extent before truncation. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch splits the hole insertion code out of btrfs_setattr into btrfs_cont_expand and updates btrfs_get_extent to properly handle the case that file extent items are not continuous. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Chris Mason 提交于
When compression was on, we were improperly ignoring -o nodatasum. This reworks the logic a bit to properly honor all the flags. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The byte walk counting was awkward and error prone. This uses the number of pages sent the higher layer to build bios. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 30 10月, 2008 1 次提交
-
-
由 Chris Mason 提交于
This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 10月, 2008 1 次提交
-
-
由 Christoph Hellwig 提交于
Creating a subvolume is in many ways like a normal VFS ->mkdir, and we really need to play with the VFS topology locking rules. So instead of just creating the snapshot on disk and then later getting rid of confliting aliases do it correctly from the start. This will become especially important once we allow for subvolumes anywhere in the tree, and not just below a hidden root. Note that snapshots will need the same treatment, but do to the delay in creating them we can't do it currently. Chris promised to fix that issue, so I'll wait on that. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 09 10月, 2008 3 次提交
-
-
由 Yan Zheng 提交于
Due to the optimization for truncate, tree leaves only containing checksum items can be deleted without being COW'ed first. This causes reference cache misses. The way to fix the miss is create cache entries for tree leaves only contain checksum. This patch also fixes a -EEXIST issue in shared reference cache. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch makes btrfs count space allocated to file in bytes instead of 512 byte sectors. Everything else in btrfs uses a byte count instead of sector sizes or blocks sizes, so this fits better. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 04 10月, 2008 1 次提交
-
-
由 Chris Mason 提交于
On 32 bit machines without CONFIG_LBD, the bi_sector field is only 32 bits. Btrfs needs to cast it before shifting up, or we end up doing IO into the wrong place. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 02 10月, 2008 1 次提交
-
-
由 Chris Mason 提交于
Checksum items take up a significant portion of the metadata for large files. It is possible to avoid reading them during truncates by checking the keys in the higher level nodes. If a given leaf is followed by another leaf where the lowest key is a checksum item from the same file, we know we can safely delete the leaf without reading it. For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete the file with a cold cache. It is read bound during the run. With this change, Btrfs is able to delete the file in 0.5s Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 30 9月, 2008 1 次提交
-
-
由 Chris Mason 提交于
This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 29 9月, 2008 1 次提交
-
-
由 Chris Mason 提交于
btrfs-vol -a /dev/xxx will zero the first and last two MB of the device. The kernel code needs to wait for this IO to finish before it adds the device. btrfs metadata IO does not happen through the block device inode. A separate address space is used, allowing the zero filled buffer heads in the block device inode to be written to disk after FS metadata starts going down to the disk via the btrfs metadata inode. The end result is zero filled metadata blocks after adding new devices into the filesystem. The fix is a simple filemap_write_and_wait on the block device inode before actually inserting it into the pool of available devices. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-