- 12 6月, 2009 2 次提交
-
-
由 Christoph Hellwig 提交于
Make sure a superblock really is writeable by checking MS_RDONLY under s_umount. sync_filesystems needed some re-arragement for that, but all but one sync_filesystem caller had the correct locking already so that we could add that check there. cachefiles grew s_umount locking. I've also added a WARN_ON to sync_filesystem to assert this for future callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 11 6月, 2009 7 次提交
-
-
由 Chris Mason 提交于
During tree log replay, we read in the tree log roots, process them and then free them. A recent change takes an extra reference on the root node of the tree when the root is read in, and stores that reference in root->commit_root. This reference was not being freed, leaving us with one buffer pinned in ram for each subvol with a tree log root after a crash. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This happens during subvol creation. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
It was printing nodatacsum, which was not the correct option name. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
lookup_inline_extent_backref only checks for duplicate backref for data extents. It assumes backrefs for tree block never conflict. This patch makes lookup_inline_extent_backref check for duplicate backrefs for both data and tree block, so that we can detect potential bug earlier. This is a safety check, strictly speaking it is not required. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Shin Hong 提交于
This patch fixes a bug which may result race condition between btrfs_start_workers() and worker_loop(). btrfs_start_workers() executed in a parent thread writes on workers->worker and worker_loop() in a child thread reads workers->worker. However, there is no synchronization enforcing the order of two operations. This patch makes btrfs_start_workers() fill workers->worker before it starts a child thread with worker_loop() Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Hisashi Hifumi 提交于
write_dev_supers is called in sequence. First is it called with wait == 0, which starts IO on all of the super blocks for a given device. Then it is called with wait == 1 to make sure they all reach the disk. It doesn't currently pin the buffers between the two calls, and it also assumes the buffers won't go away between the two calls, leading to an oops if the VM manages to free the buffers in the middle of the sync. This fixes that assumption and updates the code to return an error if things are not up to date when the wait == 1 run is done. Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
On multi-device filesystems, btrfs writes supers to all of the devices before considering a sync complete. There wasn't any additional locking between super writeout and the device list management code because device management was done inside a transaction and super writeout only happened with no transation writers running. With the btrfs fsync log and other async transaction updates, this has been racey for some time. This adds a mutex to protect the device list. The existing volume mutex could not be reused due to transaction lock ordering requirements. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 6月, 2009 16 次提交
-
-
由 Al Viro 提交于
... otherwise generic_permission() will allow *anything* for all files you don't own and that have some group permissions. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Hisashi Hifumi 提交于
In btrfs, fdatasync and fsync are identical, but fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 1980.6540s total number of events: 10001 total time taken by event execution: 1192.9804 per-request statistics: min: 0.0000s avg: 0.1193s max: 15.3720s approx. 95 percentile: 0.7257s Threads fairness: events (avg/stddev): 625.0625/151.32 execution time (avg/stddev): 74.5613/9.46 -2.6.30-rc8-patched Test execution summary: total time: 1695.9118s total number of events: 10000 total time taken by event execution: 871.3214 per-request statistics: min: 0.0000s avg: 0.0871s max: 10.4644s approx. 95 percentile: 0.4787s Threads fairness: events (avg/stddev): 625.0000/131.86 execution time (avg/stddev): 54.4576/8.98 Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 David Woodhouse 提交于
There's no need to preserve this abstraction; it used to let us use hardware crc32c support directly, but libcrc32c is already doing that for us through the crypto API -- so we're already using the Intel crc32c acceleration where appropriate. Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Add support for the standard attributes set via chattr and read via lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were confusing the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
During mount, btrfs will check the queue nonrot flag for all the devices found in the FS. If they are all non-rotating, SSD mode is enabled by default. If the FS was mounted with -o nossd, the non-rotating flag is ignored. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Some SSDs perform best when reusing block numbers often, while others perform much better when clustering strictly allocates big chunks of unused space. The default mount -o ssd will find rough groupings of blocks where there are a bunch of free blocks that might have some allocated blocks mixed in. mount -o ssd_spread will make sure there are no allocated blocks mixed in. It should perform better on lower end SSDs. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
In SSD mode for data, and all the time for metadata the allocator will try to find a cluster of nearby blocks for allocations. This commit adds extra checks to make sure that each free block in the cluster is close to the last one. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This allows you to turn off the ssd mode via remount. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The btrfs IO submission threads try to service a bunch of devices with a small number of threads. They do a congestion check to try and avoid waiting on requests for a busy device. The checks make sure we've sent a few requests down to a given device just so that we aren't bouncing between busy devices without actually sending down any IO. The counter used to decide if we can switch to the next device is somewhat overloaded. It is also being used to decide if we've done a good batch of requests between the WRITE_SYNC or regular priority lists. It may get reset to zero often, leaving us hammering on a busy device instead of moving on to another disk. This commit adds a new counter for the number of bios sent while servicing a device. It doesn't get reset or fiddled with. On multi-device filesystems, this fixes IO stalls in streaming write workloads. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs uses dedicated threads to submit bios when checksumming is on, which allows us to make sure the threads dedicated to checksumming don't get stuck waiting for requests. For each btrfs device, there are two lists of bios. One list is for WRITE_SYNC bios and the other is for regular priority bios. The IO submission threads used to process all of the WRITE_SYNC bios first and then switch to the regular bios. This commit makes sure we don't completely starve the regular bios by rotating between the two lists. WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries to run in batches to avoid seeking. Benchmarking shows this eliminates stalls during streaming buffered writes on both multi-device and single device filesystems. If the regular bios starve, the system can end up with a large amount of ram pinned down in writeback pages. If we are a little more fair between the two classes, we're able to keep throughput up and make progress on the bulk of our dirty ram. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Once a metadata block has been written, it must be recowed, so the btrfs dirty balancing call has a check to make sure a fair amount of metadata was actually dirty before it started writing it back to disk. A previous commit had changed the dirty tracking for metadata without updating the btrfs dirty balancing checks. This commit switches it to use the correct counter. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The block allocator in SSD mode will try to find groups of free blocks that are close together. This commit makes it loop less on a given group size before bumping it. The end result is that we are less likely to fill small holes in the available free space, but we don't waste as much CPU building the large cluster used by ssd mode. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
With the new back reference code, the cost of a balance has gone down in terms of the number of back reference updates done. This commit makes us more aggressively balance leaves and nodes as they become less full. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
When the delayed reference code was added, some checks were added to avoid extra balancing while the delayed references were being flushed. This made for less efficient btrees, but it reduced the chances of loops where no forward progress was made because the balances made more delayed ref updates. With the new dead root removal code and the mixed back references, the extent allocation tree is no longer using precise back refs, and the delayed reference updates don't carry the risk of looping forever anymore. So, the balance avoidance is no longer required. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
There are some 'start = state->end + 1;' like code in set_extent_bit and clear_extent_bit. They overflow when end == (u64)-1. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 05 6月, 2009 1 次提交
-
-
由 Chris Mason 提交于
The btrfs allocator uses list_for_each to walk the available block groups when searching for free blocks. It starts off with a hint to help find the best block group for a given allocation. The hint is resolved into a block group, but we don't properly check to make sure the block group we find isn't in the middle of being freed due to filesystem shrinking or balancing. If it is being freed, the list pointers in it are bogus and can't be trusted. But, the code happily goes along and uses them in the list_for_each loop, leading to all kinds of fun. The fix used here is to check to make sure the block group we find really is on the list before we use it. list_del_init is used when removing it from the list, so we can do a proper check. The allocation clustering code has a similar bug where it will trust the block group in the current free space cluster. If our allocation flags have changed (going from single spindle dup to raid1 for example) because the drives in the FS have changed, we're not allowed to use the old block group any more. The fix used here is to check the current cluster against the current allocation flags. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 04 6月, 2009 1 次提交
-
-
由 Yan Zheng 提交于
It was not being properly initialized, and so the size saved to disk was not correct. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 15 5月, 2009 6 次提交
-
-
由 Sankar P 提交于
Signed-off-by: NSankar P <sankar.curiosity@gmail.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
The notreelog and flushoncommit mount options were being printed slightly differently. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Hong 提交于
In Li Zefan's commit dae7b665, a combination call of kmalloc() and copy_from_user() is replaced by memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more. Signed-off-by: NLi Hong <lihong.hi@gmail.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
These debugging WARN_ONs make too much console noise during regular IO failures. An IO failure will still generate a number of messages as we verify checksums etc, but these two are not needed. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
When a btrfs metadata read fails, the first thing we try to do is find a good copy on another mirror of the block. If this fails, read_tree_block() ends up returning a buffer that isn't up to date. The btrfs btree reading code was reworked to drop locks and repeat the search when IO was done, but the changes didn't add a check for failed reads. The end result was looping forever on buffers that were never going to become up to date. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
This flag is used to decide when we need to send a given file through the ordered code to make sure it is fully written before a transaction commits. It was not being properly set to zero when the inode was being setup. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 09 5月, 2009 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 4月, 2009 2 次提交
-
-
由 Chris Mason 提交于
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items. If it is certain a given inode has no acls, it will set the in memory acl fields to null to avoid acl lookups completely. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Linus noticed the btrfs code to cache acls wasn't properly caching a NULL acl when the inode didn't have any acls. This meant the common case of no acls resulted in expensive btree searches every time the kernel checked permissions (which is quite often). This is a modified version of Linus' original patch: Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode. This forces an acl lookup when permission checks are done. Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields are set to null. Fix btrfs_get_acl to use the right return value from __btrfs_getxattr when deciding to cache a NULL acl. It was storing a NULL acl when __btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning -ENODATA for this case. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 27 4月, 2009 4 次提交
-
-
由 Joel Becker 提交于
Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch to cast them all. Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Joel Becker 提交于
A small warning popped up on ia64 because inode-map.c was comparing a u64 object id with the ULL FIRST_FREE_OBJECTID. My first thought was that all the OBJECTID constants should contain the u64 cast because btrfs code deals entirely in u64s. But then I saw how large that was, and figured I'd just fix the max() call. Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs has printks for various IO errors, including bad checksums and mismatches between what we expect the block headers to contain and what we actually find on the disk. Longer term we need a real reporting mechanism for this, but for now printk is going to have to do. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-