- 21 5月, 2011 1 次提交
-
-
由 Miao Xie 提交于
Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dave@jikos.cz> Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 15 5月, 2011 3 次提交
-
-
由 Li Zefan 提交于
Steps to reproduce the bug: - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL - Call FS_IOC_SETFLAGS ioctl with flags=0 - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set! Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Zefan 提交于
As we've added per file compression/cow support. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Zefan 提交于
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file COW in btrfs, but FS_NOCOW_FL is sufficient. The fact is we don't have corresponding BTRFS_INODE_COW flag. COW is default, and FS_NOCOW_FL can be used to switch off COW for a single file. If we mount btrfs with nodatacow, a newly created file will be set with the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the FS_NOCOW_FL flag. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 12 4月, 2011 1 次提交
-
-
由 Daniel J Blueman 提交于
Fix address space annotation correct in ioctl.c. Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com> BTRFS_BLOCK_GROUP_SYSTEM, @@ -2387,7 +2387,7 @@ long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) up_read(&info->groups_sem); } - user_dest = (struct btrfs_ioctl_space_info *) + user_dest = (struct btrfs_ioctl_space_info __user *) (arg + sizeof(struct btrfs_ioctl_space_args)); if (copy_to_user(user_dest, dest_orig, alloc_size)) Reviewed-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 05 4月, 2011 2 次提交
-
-
由 Li Zefan 提交于
root_item->flags and root_item->byte_limit are not initialized when a subvolume is created. This bug is not revealed until we added readonly snapshot support - now you mount a btrfs filesystem and you may find the subvolumes in it are readonly. To work around this problem, we steal a bit from root_item->inode_item->flags, and use it to indicate if those fields have been properly initialized. When we read a tree root from disk, we check if the bit is set, and if not we'll set the flag and initialize the two fields of the root item. Reported-by: NAndreas Philipp <philipp.andreas@gmail.com> Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Tested-by: NAndreas Philipp <philipp.andreas@gmail.com> cc: stable@kernel.org Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Tsutomu Itoh 提交于
Call btrfs_end_transaction() if btrfs_commit_transaction_async() fails. Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 28 3月, 2011 4 次提交
-
-
由 liubo 提交于
setflags ioctl should return error when any checks fail. Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Dongyang 提交于
We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: NLi Dongyang <lidongyang@novell.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Liu Bo 提交于
Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to Chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1->v2: - rebase to the latest btrfs. v2->v3: - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW will be screwed by inheritance from parent directory. Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Tsutomu Itoh 提交于
This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 24 3月, 2011 1 次提交
-
-
由 Serge E. Hallyn 提交于
And give it a kernel-doc comment. [akpm@linux-foundation.org: btrfs changed in linux-next] Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: NDavid Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 3月, 2011 1 次提交
-
-
由 Josef Bacik 提交于
If we cannot truncate an inode for some reason we will never delete the orphan item associated with that inode, which means that we will loop forever in btrfs_orphan_cleanup. Instead of doing this just return error so we fail to mount. It sucks, but hey it's better than hanging. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 17 2月, 2011 1 次提交
-
-
由 Li Zefan 提交于
- Check user-specified flags correctly - Check the inode owership - Search root item in root tree but not fs tree Reported-by: NDan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 15 2月, 2011 1 次提交
-
-
由 Dan Rosenberg 提交于
Commit bf5fc093 refactored btrfs_ioctl_space_info() and introduced several security issues. space_args.space_slots is an unsigned 64-bit type controlled by a possibly unprivileged caller. The comparison as a signed int type allows providing values that are treated as negative and cause the subsequent allocation size calculation to wrap, or be truncated to 0. By providing a size that's truncated to 0, kmalloc() will return ZERO_SIZE_PTR. It's also possible to provide a value smaller than the slot count. The subsequent loop ignores the allocation size when copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR. The fix changes the slot count type and comparison typecast to u64, which prevents truncation or signedness errors, and also ensures that we don't copy more data than we've allocated in the subsequent loop. Note that zero-size allocations are no longer possible since there is already an explicit check for space_args.space_slots being 0 and truncation of this value is no longer an issue. Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: NJosef Bacik <josef@redhat.com> Reviewed-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 01 2月, 2011 1 次提交
-
-
由 Tsutomu Itoh 提交于
The error check of btrfs_start_transaction() is added, and the mistake of the error check on several places is corrected. Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 29 1月, 2011 2 次提交
-
-
由 Tsutomu Itoh 提交于
btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL. So, it is necessary to use IS_ERR() to check the return value. Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Tsutomu Itoh 提交于
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock() is added, and the mistake of the error check in several places is corrected. For more stable Btrfs, I think that we should reduce BUG_ON(). But, I think that long time is necessary for this. So, I propose this patch as a short-term solution. With this patch: - To more stable Btrfs, the part that should be corrected is clarified. - The panic isn't done by the NULL pointer reference etc. (even if BUG_ON() is increased temporarily) - The error code is returned in the place where the error can be easily returned. As a long-term plan: - BUG_ON() is reduced by using the forced-readonly framework, etc. Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 27 1月, 2011 1 次提交
-
-
由 Li Zefan 提交于
Suppose: - the source extent is: [0, 100] - the src offset is 10 - the clone length is 90 - the dest offset is 0 This statement: new_key.offset = key.offset + destoff - off will produce such an extent for the dest file: [ino, BTRFS_EXTENT_DATA_KEY, -10] , which is obviously wrong. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
- 23 12月, 2010 3 次提交
-
-
由 Li Zefan 提交于
This allows us to set a snapshot or a subvolume readonly or writable on the fly. Usage: Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS); Changelog for v3: - Change to pass __u64 as ioctl parameter. Changelog for v2: - Add _GETFLAGS ioctl. - Check if the passed fd is the root of a subvolume. - Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
由 Li Zefan 提交于
Usage: Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call ioctl(BTRFS_I0CTL_SNAP_CREATE_V2). Implementation: - Set readonly bit of btrfs_root_item->flags. - Add readonly checks in btrfs_permission (inode_permission), btrfs_setattr, btrfs_set/remove_xattr and some ioctls. Changelog for v3: - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags. - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
由 Li Zefan 提交于
Split it into two functions for two different ioctls, since they share no common code. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
- 22 12月, 2010 2 次提交
-
-
由 Li Zefan 提交于
Update defrag ioctl, so one can choose lzo or zlib when turning on compression in defrag operation. Changelog: v1 -> v2 - Add incompability flag. - Fix to check invalid compress type. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
由 Li Zefan 提交于
Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
- 11 12月, 2010 2 次提交
-
-
由 Li Zefan 提交于
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we wouldn't have to create a new structure for async snapshot creation. Here we convert async snapshot ioctl to use a more generic ABI, as we'll add more ioctls for snapshots/subvolumes in the future, readonly snapshots for example. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
We were incorrectly taking the async path even for the sync ioctls by passing in &transid unconditionally. There's ample room for further cleanup here, but this keeps the fix simple. Signed-off-by: NSage Weil <sage@newdream.net> Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 22 11月, 2010 3 次提交
-
-
由 Josef Bacik 提交于
There are lots of places where we do dentry->d_parent->d_inode without holding the dentry->d_lock. This could cause problems with rename. So instead we need to use dget_parent() and hold the reference to the parent as long as we are going to use it's inode and then dput it at the end. Signed-off-by: NJosef Bacik <josef@redhat.com> Cc: raven@themaw.net Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Zefan 提交于
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the original filesize of the dest file 'file2' is 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Now clone file1 to file2, the dest file should be 40K, but it still shows 30K: # ls -l /mnt/file2 -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2 Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Li Zefan 提交于
We've done the check for src_offset and src_length, and We should also check dest_offset, otherwise we'll corrupt the destination file: (After cloning file1 to file2 with unaligned dest_offset) # cat /mnt/file2 cat: /mnt/file2: Input/output error Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 30 10月, 2010 9 次提交
-
-
由 Sage Weil 提交于
Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). We additionally require that the user has write+exec permission on the subvol root inode. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
There is no reason to force an immediate commit when deleting a snapshot. Users have some expectation that space from a deleted snapshot be freed immediately, but even if we do commit the reclaim is a background process. If users _do_ want the deletion to be durable, they can call 'sync'. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
Create a snap without waiting for it to commit to disk. The ioctl is ordered such that subsequent operations will not be contained by the created snapshot, and the commit is initiated, but the ioctl does not wait for the snapshot to commit to disk. We return the specific transid to userspace so that an application can wait for this specific snapshot creation to commit via the WAIT_SYNC ioctl. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
START_SYNC will start a sync/commit, but not wait for it to complete. Any modification started after the ioctl returns is guaranteed not to be included in the commit. If a non-NULL pointer is passed, the transaction id will be returned to userspace. WAIT_SYNC will wait for any in-progress commit to complete. If a transaction id is specified, the ioctl will block and then return (success) when the specified transaction has committed. If it has already committed when we call the ioctl, it returns immediately. If the specified transaction doesn't exist, it returns EINVAL. If no transaction id is specified, WAIT_SYNC will wait for the currently committing transaction to finish it's commit to disk. If there is no currently committing transaction, it returns success. These ioctls are useful for applications which want to impose an ordering on when fs modifications reach disk, but do not want to wait for the full (slow) commit process to do so. Picky callers can take the transid returned by START_SYNC and feed it to WAIT_SYNC, and be certain to wait only as long as necessary for the transaction _they_ started to reach disk. Sloppy callers can START_SYNC and WAIT_SYNC without a transid, and provided they didn't wait too long between the calls, they will get the same result. However, if a second commit starts before they call WAIT_SYNC, they may end up waiting longer for it to commit as well. Even so, a START_SYNC+WAIT_SYNC still guarantees that any operation completed before the START_SYNC reaches disk. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
I'm no lockdep expert, but this appears to make the lockdep warning go away for the i_mutex locking in the clone ioctl. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
We had an edge case issue where the requested range was just following an existing extent. Instead of skipping to the next extent, we used the previous one which lead to having zero sized extents. Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Sage Weil 提交于
The lookup_first_ordered_extent() was done on the wrong inode, and the ->delalloc_bytes test was wrong, as the following btrfs_wait_ordered_range() would only invoke a range write and wouldn't write the entire file data range. Also, a bad parameter was passed to btrfs_wait_ordered_range(). Signed-off-by: NYehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Andi Kleen 提交于
These are all the cases where a variable is set, but not read which are not bugs as far as I can see, but simply leftovers. Still needs more review. Found by gcc 4.6's new warnings Signed-off-by: NAndi Kleen <ak@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Julia Lawall 提交于
Use memdup_user when user data is immediately copied into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = \(kmalloc@p\|kzalloc@p\)(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) || ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: NJulia Lawall <julia@diku.dk> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 23 10月, 2010 1 次提交
-
-
由 Josef Bacik 提交于
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space info's for each RAID type. So instead, loop through each space info's raid lists so we can get the right RAID information which will allow the df ioctl to tell us RAID types again. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 20 7月, 2010 1 次提交
-
-
由 Dan Rosenberg 提交于
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check whether the donor file is append-only before writing to it. 2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer overflow that allows a user to specify an out-of-bounds range to copy from the source file (if off + len wraps around). I haven't been able to successfully exploit this, but I'd imagine that a clever attacker could use this to read things he shouldn't. Even if it's not exploitable, it couldn't hurt to be safe. Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com> cc: stable@kernel.org Signed-off-by: NChris Mason <chris.mason@oracle.com>
-