- 24 6月, 2016 10 次提交
-
-
由 Eric W. Biederman 提交于
Now that SB_I_NODEV controls the nodev behavior devpts can just clear this flag during mount. Simplifying the code and making it easier to audit how the code works. While still preserving the invariant that s_iflags is only modified during mount. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Replace the implict setting of MNT_NODEV on mounts that happen with just user namespace permissions with an implicit setting of SB_I_NODEV in s_iflags. The visibility of the implicit MNT_NODEV has caused problems in the past. With this change the fragile case where an implicit MNT_NODEV needs to be preserved in do_remount is removed. Using SB_I_NODEV is much less fragile as s_iflags are set during the original mount and never changed. In do_new_mount with the implicit setting of MNT_NODEV gone, the only code that can affect mnt_flags is fs_fully_visible so simplify the if statement and reduce the indentation of the code to make that clear. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Verify all filesystems that we check in mount_too_revealing set SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today and it should remain true in the future. Remove the now unnecessary checks from mnt_already_visibile that ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are preserved. Making the code shorter and easier to read. Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current systems where proc and sysfs are mounted with "nosuid, nodev, noexec" and several slightly buggy container applications don't bother to set those flags continue to work. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Introduce a function may_open_dev that tests MNT_NODEV and a new superblock flab SB_I_NODEV. Use this new function in all of the places where MNT_NODEV was previously tested. Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those filesystems should never support device nodes, and a simple superblock flags makes that very hard to get wrong. With SB_I_NODEV set if any device nodes somehow manage to show up on on a filesystem those device nodes will be unopenable. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
The cgroup filesystem is in the same boat as sysfs. No one ever permits executables of any kind on the cgroup filesystem, and there is no reasonable future case to support executables in the future. Therefore move the setting of SB_I_NOEXEC which makes the code proof against future mistakes of accidentally creating executables from sysfs to kernfs itself. Making the code simpler and covering the sysfs, cgroup, and cgroup2 filesystems. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Allowing a filesystem to be mounted by other than root in the initial user namespace is a filesystem property not a mount namespace property and as such should be checked in filesystem specific code. Move the FS_USERNS_MOUNT test into super.c:sget_userns(). Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Start marking filesystems with a user namespace owner, s_user_ns. In this change this is only used for permission checks of who may mount a filesystem. Ultimately s_user_ns will be used for translating ids and checking capabilities for filesystems mounted from user namespaces. The default policy for setting s_user_ns is implemented in sget(), which arranges for s_user_ns to be set to current_user_ns() and to ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that user_ns. The guts of sget are split out into another function sget_userns(). The function sget_userns calls alloc_super with the specified user namespace or it verifies the existing superblock that was found has the expected user namespace, and fails with EBUSY when it is not. This failing prevents users with the wrong privileges mounting a filesystem. The reason for the split of sget_userns from sget is that in some cases such as mount_ns and kernfs_mount_ns a different policy for permission checking of mounts and setting s_user_ns is necessary, and the existence of sget_userns() allows those policies to be implemented. The helper mount_ns is expected to be used for filesystems such as proc and mqueuefs which present per namespace information. The function mount_ns is modified to call sget_userns instead of sget to ensure the user namespace owner of the namespace whose information is presented by the filesystem is used on the superblock. For sysfs and cgroup the appropriate permission checks are already in place, and kernfs_mount_ns is modified to call sget_userns so that the init_user_ns is the only user namespace used. For the cgroup filesystem cgroup namespace mounts are bind mounts of a subset of the full cgroup filesystem and as such s_user_ns must be the same for all of them as there is only a single superblock. Mounts of sysfs that vary based on the network namespace could in principle change s_user_ns but it keeps the analysis and implementation of kernfs simpler if that is not supported, and at present there appear to be no benefits from supporting a different s_user_ns on any sysfs mount. Getting the details of setting s_user_ns correct has been a long process. Thanks to Pavel Tikhorirorv who spotted a leak in sget_userns. Thanks to Seth Forshee who has kept the work alive. Thanks-to: Seth Forshee <seth.forshee@canonical.com> Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Move the call of get_pid_ns, the call of proc_parse_options, and the setting of s_iflags into proc_fill_super so that mount_ns can be used. Convert proc_mount to call mount_ns and remove the now unnecessary code. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Reviewed-by: NDjalal Harouni <tixxdz@gmail.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Today what is normally called data (the mount options) is not passed to fill_super through mount_ns. Pass the mount options and the namespace separately to mount_ns so that filesystems such as proc that have mount options, can use mount_ns. Pass the user namespace to mount_ns so that the standard permission check that verifies the mounter has permissions over the namespace can be performed in mount_ns instead of in each filesystems .mount method. Thus removing the duplication between mqueuefs and proc in terms of permission checks. The extra permission check does not currently affect the rpc_pipefs filesystem and the nfsd filesystem as those filesystems do not currently allow unprivileged mounts. Without unpvileged mounts it is guaranteed that the caller has already passed capable(CAP_SYS_ADMIN) which guarantees extra permission check will pass. Update rpc_pipefs and the nfsd filesystem to ensure that the network namespace reference is always taken in fill_super and always put in kill_sb so that the logic is simpler and so that errors originating inside of fill_super do not cause a network namespace leak. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Replace the call of fs_fully_visible in do_new_mount from before the new superblock is allocated with a call of mount_too_revealing after the superblock is allocated. This winds up being a much better location for maintainability of the code. The first change this enables is the replacement of FS_USERNS_VISIBLE with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type to sb_iflags on the superblock. Unfortunately mount_too_revealing fundamentally needs to touch mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate times. If the mnt_flags did not need to be touched the code could be easily moved into the filesystem specific mount code. Acked-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 15 6月, 2016 1 次提交
-
-
由 Eric W. Biederman 提交于
In rare cases it is possible for s_flags & MS_RDONLY to be set but MNT_READONLY to be clear. This starting combination can cause fs_fully_visible to fail to ensure that the new mount is readonly. Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY is set on the source filesystem of the mount. In general both MS_RDONLY and MNT_READONLY are set at the same for mounts so I don't expect any programs to care. Nor do I expect MS_RDONLY to be set on proc or sysfs in the initial user namespace, which further decreases the likelyhood of problems. Which means this change should only affect system configurations by paranoid sysadmins who should welcome the additional protection as it keeps people from wriggling out of their policies. Cc: stable@vger.kernel.org Fixes: 8c6cf9cc ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime") Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 07 6月, 2016 2 次提交
-
-
由 Eric W. Biederman 提交于
MNT_LOCKED implies on a child mount implies the child is locked to the parent. So while looping through the children the children should be tested (not their parent). Typically an unshare of a mount namespace locks all mounts together making both the parent and the slave as locked but there are a few corner cases where other things work. Cc: stable@vger.kernel.org Fixes: ceeb0e5d ("vfs: Ignore unlocked mounts in fs_fully_visible") Reported-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
由 Eric W. Biederman 提交于
Add this trivial missing error handling. Cc: stable@vger.kernel.org Fixes: 1b852bce ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace") Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 06 6月, 2016 1 次提交
-
-
由 Eric W. Biederman 提交于
The /dev/ptmx device node is changed to lookup the directory entry "pts" in the same directory as the /dev/ptmx device node was opened in. If there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx uses that filesystem. Otherwise the open of /dev/ptmx fails. The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that userspace can now safely depend on each mount of devpts creating a new instance of the filesystem. Each mount of devpts is now a separate and equal filesystem. Reserved ttys are now available to all instances of devpts where the mounter is in the initial mount namespace. A new vfs helper path_pts is introduced that finds a directory entry named "pts" in the directory of the passed in path, and changes the passed in path to point to it. The helper path_pts uses a function path_parent_directory that was factored out of follow_dotdot. In the implementation of devpts: - devpts_mnt is killed as it is no longer meaningful if all mounts of devpts are equal. - pts_sb_from_inode is replaced by just inode->i_sb as all cached inodes in the tty layer are now from the devpts filesystem. - devpts_add_ref is rolled into the new function devpts_ptmx. And the unnecessary inode hold is removed. - devpts_del_ref is renamed devpts_release and reduced to just a deacrivate_super. - The newinstance mount option continues to be accepted but is now ignored. In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as they are never used. Documentation/filesystems/devices.txt is updated to describe the current situation. This has been verified to work properly on openwrt-15.05, centos5, centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3, ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1, slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the caveat that on centos6 and on slackware-14.1 that there wind up being two instances of the devpts filesystem mounted on /dev/pts, the lower copy does not end up getting used. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg KH <greg@kroah.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 6月, 2016 1 次提交
-
-
由 Chris Mason 提交于
When dealing with inline extents, btrfs_get_extent will incorrectly try to insert a duplicate extent_map. The dup hits -EEXIST from add_extent_map, but then we try to merge with the existing one and end up trying to insert a zero length extent_map. This actually works most of the time, except when there are extent maps past the end of the inline extent. rocksdb will trigger this sometimes because it preallocates an extent and then truncates down. Josef made a script to trigger with xfs_io: #!/bin/bash xfs_io -f -c "pwrite 0 1000" inline xfs_io -c "falloc -k 4k 1M" inline xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline xfs_io -c "fadvise -d 0 1000" inline cat inline You'll get EIOs trying to read inline after this because add_extent_map is returning EEXIST Signed-off-by: NChris Mason <clm@fb.com>
-
- 01 6月, 2016 7 次提交
-
-
由 Yan, Zheng 提交于
Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
There are several issues in fscache revalidation code. - In ceph_revalidate_work(), fscache_invalidate() is called when fscache_check_consistency() return 0. This is complete wrong because 0 means cache is valid. - Handle_cap_grant() calls ceph_queue_revalidate() if client already has CAP_FILE_CACHE. This code is confusing. Client should revalidate the cache each time it got CAP_FILE_CACHE anew. - In Handle_cap_grant(), fscache_invalidate() is called if MDS revokes CAP_FILE_CACHE. This is inconsistency with the case that inode get evicted. In the later case, the cache is not discarded. Client may use the cache when inode is reloaded. This patch moves the fscache revalidation into ceph_get_caps(). Client revalidates the cache after it gets CAP_FILE_CACHE. i_rdcache_gen should keep constance while CAP_FILE_CACHE is used. If i_fscache_gen is not equal to i_rdcache_gen, client needs to check cache's consistency. Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
All other filesystems do not add dirty pages to fscache. They all disable fscache when inode is opened for write. Only ceph adds dirty pages to fscache, but the code is buggy. Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
ceph_fill_file_size() has already called ceph_fscache_invalidate() if it return true. Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
If readpages fails, fscache needs to cleanup its internal state. Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
__fscache_check_consistency() calls check_consistency() callback and return the callback's return value. But the return type of check_consistency() is bool. So __fscache_check_consistency() return 1 if the cache is inconsistent. This is inconsistent with the document. Signed-off-by: NYan, Zheng <zyan@redhat.com> Acked-by: NDavid Howells <dhowells@redhat.com>
-
由 Yan, Zheng 提交于
Signed-off-by: NYan, Zheng <zyan@redhat.com> Acked-by: NDavid Howells <dhowells@redhat.com>
-
- 31 5月, 2016 3 次提交
-
-
由 Filipe Manana 提交于
While we are finishing a device replace operation we can have a concurrent task trying to do a read repair operation, in which case it will call btrfs_map_block() to get a struct btrfs_bio which can have a stripe that points to the source device of the device replace operation. This allows for the read repair task to dereference the stripe's device pointer after the device replace operation has freed the source device, resulting in an invalid memory access. This is similar to the problem solved by my previous patch in the same series and named "Btrfs: fix race between device replace and discard". So fix this by surrounding the call to btrfs_map_block() and the code that uses the returned struct btrfs_bio with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the proper serialization with the finishing phase of the device replace operation. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
While we are finishing a device replace operation, we can make a discard operation (fs mounted with -o discard) do an invalid memory access like the one reported by the following trace: [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs] [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000 [ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7 [ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246 [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000 [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48 [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001 [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8 [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b [ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000 [ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0 [ 3206.388595] Stack: [ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000 [ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0 [ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0 [ 3206.388595] Call Trace: [ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs] [ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs] [ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e [ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e [ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a [ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14 [ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14 [ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa This happens because when we call btrfs_map_block() from btrfs_discard_extent() to get a btrfs_bio structure, the device replace operation has not finished yet, but before we use the device of one of the stripes from the returned btrfs_bio structure, the device object is freed. This is illustrated by the following diagram. CPU 1 CPU 2 btrfs_dev_replace_start() (...) btrfs_dev_replace_finishing() btrfs_start_transaction() btrfs_commit_transaction() (...) btrfs_sync_file() btrfs_start_transaction() (...) btrfs_commit_transaction() btrfs_finish_extent_commit() btrfs_discard_extent() btrfs_map_block() --> returns a struct btrfs_bio with a stripe that has a device field pointing to source device of the replace operation (the device that is being replaced) mutex_lock(&uuid_mutex) mutex_lock(&fs_info->fs_devices->device_list_mutex) mutex_lock(&fs_info->chunk_mutex) btrfs_dev_replace_update_device_in_mapping_tree() --> iterates the mapping tree and for each extent map that has a stripe pointing to the source device, it updates the stripe to point to the target device instead btrfs_rm_dev_replace_blocked() --> waits for fs_info->bio_counter to go down to 0 btrfs_rm_dev_replace_remove_srcdev() --> removes source device from the list of devices mutex_unlock(&fs_info->chunk_mutex) mutex_unlock(&fs_info->fs_devices->device_list_mutex) mutex_unlock(&uuid_mutex) btrfs_rm_dev_replace_free_srcdev() --> frees the source device --> iterates over all stripes of the returned struct btrfs_bio --> for each stripe it dereferences its device pointer --> it ends up finding a pointer to the device used as the source device for the replace operation and that was already freed So fix this by surrounding the call to btrfs_map_block(), and the code that uses the returned struct btrfs_bio, with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that the finishing phase of the device replace operation blocks until the the bio counter decreases to zero before it frees the source device. This is the same approach we do at btrfs_map_bio() for example. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Ilya Dryomov 提交于
For the benefit of every single caller, take osdc instead of map. Also, now that osdc->osdmap can't ever be NULL, drop the check. Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 30 5月, 2016 6 次提交
-
-
由 Filipe Manana 提交于
While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
After it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
We were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
When we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1 CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X <transaction N starts> starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW mode dev_replace->cursor_left is set to the logical end offset of block group X So fix this by waiting for all cow and nocow writes after setting a block group to readonly mode. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
When it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] ------------[ cut here ]------------ [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0 [ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18 [ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd [ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc [ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37 ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
The list of devices is protected by the device_list_mutex and the device replace code, in its finishing phase correctly takes that mutex before removing the source device from that list. However the readahead code was iterating that list without acquiring the respective mutex leading to crashes later on due to invalid memory accesses: [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport processor ser [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G W 4.6.0-rc7-btrfs-next-29+ #1 [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000 [125671.834973] RIP: 0010:[<ffffffff81270479>] [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105 [125671.834973] RSP: 0018:ffff8801ac91bc28 EFLAGS: 00010206 [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000 [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8 [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000 [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8 [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff [125671.834973] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000 [125671.834973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0 [125671.834973] Stack: [125671.834973] 0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000 [125671.834973] 0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000 [125671.834973] ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0 [125671.834973] Call Trace: [125671.834973] [<ffffffff81270541>] radix_tree_lookup+0xd/0xf [125671.834973] [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs] [125671.834973] [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs] [125671.834973] [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs] [125671.834973] [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs] [125671.834973] [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs] [125671.834973] [<ffffffff81069691>] process_one_work+0x271/0x4e9 [125671.834973] [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9 [125671.834973] [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3 [125671.834973] [<ffffffff8106f403>] kthread+0xd4/0xdc [125671.834973] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [125671.834973] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 So fix this by taking the device_list_mutex in the readahead code. We can't use here the lighter approach of using a rcu_read_lock() and rcu_read_unlock() pair together with a list_for_each_entry_rcu() call because we end up doing calls to sleeping functions (kzalloc()) in the respective code path. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
- 29 5月, 2016 9 次提交
-
-
由 George Spelvin 提交于
The self-test was updated to cover zero-length strings; the function needs to be updated, too. Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net> Fixes: fcfd2fbf ("fs/namei.c: Add hashlen_string() function") Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 George Spelvin 提交于
The original name was simply hash_string(), but that conflicted with a function with that name in drivers/base/power/trace.c, and I decided that calling it "hashlen_" was better anyway. But you have to do it in two places. [ This caused build errors for architectures that don't define CONFIG_DCACHE_WORD_ACCESS - Linus ] Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net> Reported-by: NGuenter Roeck <linux@roeck-us.net> Fixes: fcfd2fbf ("fs/namei.c: Add hashlen_string() function") Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mikulas Patocka 提交于
The HPFS filesystem used generic_show_options to produce string that is displayed in /proc/mounts. However, there is a problem that the options may disappear after remount. If we mount the filesystem with option1 and then remount it with option2, /proc/mounts should show both option1 and option2, however it only shows option2 because the whole option string is replaced with replace_mount_options in hpfs_remount_fs. To fix this bug, implement the hpfs_show_options function that prints options that are currently selected. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mikulas Patocka 提交于
Commit c8f33d0b ("affs: kstrdup() memory handling") checks if the kstrdup function returns NULL due to out-of-memory condition. However, if we are remounting a filesystem with no change to filesystem-specific options, the parameter data is NULL. In this case, kstrdup returns NULL (because it was passed NULL parameter), although no out of memory condition exists. The mount syscall then fails with ENOMEM. This patch fixes the bug. We fail with ENOMEM only if data is non-NULL. The patch also changes the call to replace_mount_options - if we didn't pass any filesystem-specific options, we don't call replace_mount_options (thus we don't erase existing reported options). Fixes: c8f33d0b ("affs: kstrdup() memory handling") Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mikulas Patocka 提交于
Commit ce657611 ("hpfs: kstrdup() out of memory handling") checks if the kstrdup function returns NULL due to out-of-memory condition. However, if we are remounting a filesystem with no change to filesystem-specific options, the parameter data is NULL. In this case, kstrdup returns NULL (because it was passed NULL parameter), although no out of memory condition exists. The mount syscall then fails with ENOMEM. This patch fixes the bug. We fail with ENOMEM only if data is non-NULL. The patch also changes the call to replace_mount_options - if we didn't pass any filesystem-specific options, we don't call replace_mount_options (thus we don't erase existing reported options). Fixes: ce657611 ("hpfs: kstrdup() out of memory handling") Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Guenter Roeck 提交于
Various builds (such as i386:allmodconfig) fail with fs/binfmt_aout.c:133:2: error: expected identifier or '(' before 'return' fs/binfmt_aout.c:134:1: error: expected identifier or '(' before '}' token [ Oops. My bad, I had stupidly thought that "allmodconfig" covered this on x86-64 too, but it obviously doesn't. Egg on my face. - Linus ] Fixes: 5d22fc25 ("mm: remove more IS_ERR_VALUE abuses") Signed-off-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 George Spelvin 提交于
This is just the infrastructure; there are no users yet. This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares the existence of <asm/hash.h>. That file may define its own versions of various functions, and define HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones. Included is a self-test (in lib/test_hash.c) that verifies the basics. It is NOT in general required that the arch-specific functions compute the same thing as the generic, but if a HAVE_* symbol is defined with the value 1, then equality is tested. Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Philippe De Muyter <phdm@macq.eu> Cc: linux-m68k@lists.linux-m68k.org Cc: Alistair Francis <alistai@xilinx.com> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: uclinux-h8-devel@lists.sourceforge.jp
-
由 George Spelvin 提交于
Patch 0fed3ac8 improved the hash mixing, but the function is slower than necessary; there's a 7-instruction dependency chain (10 on x86) each loop iteration. Word-at-a-time access is a very tight loop (which is good, because link_path_walk() is one of the hottest code paths in the entire kernel), and the hash mixing function must not have a longer latency to avoid slowing it down. There do not appear to be any published fast hash functions that: 1) Operate on the input a word at a time, and 2) Don't need to know the length of the input beforehand, and 3) Have a single iterated mixing function, not needing conditional branches or unrolling to distinguish different loop iterations. One of the algorithms which comes closest is Yann Collet's xxHash, but that's two dependent multiplies per word, which is too much. The key insights in this design are: 1) Barring expensive ops like multiplies, to diffuse one input bit across 64 bits of hash state takes at least log2(64) = 6 sequentially dependent instructions. That is more cycles than we'd like. 2) An operation like "hash ^= hash << 13" requires a second temporary register anyway, and on a 2-operand machine like x86, it's three instructions. 3) A better use of a second register is to hold a two-word hash state. With careful design, no temporaries are needed at all, so it doesn't increase register pressure. And this gets rid of register copying on 2-operand machines, so the code is smaller and faster. 4) Using two words of state weakens the requirement for one-round mixing; we now have two rounds of mixing before cancellation is possible. 5) A two-word hash state also allows operations on both halves to be done in parallel, so on a superscalar processor we get more mixing in fewer cycles. I ended up using a mixing function inspired by the ChaCha and Speck round functions. It is 6 simple instructions and 3 cycles per iteration (assuming multiply by 9 can be done by an "lea" instruction): x ^= *input++; y ^= x; x = ROL(x, K1); x += y; y = ROL(y, K2); y *= 9; Not only is this reversible, two consecutive rounds are reversible: if you are given the initial and final states, but not the intermediate state, it is possible to compute both input words. This means that at least 3 words of input are required to create a collision. (It also has the property, used by hash_name() to avoid a branch, that it hashes all-zero to all-zero.) The rotate constants K1 and K2 were found by experiment. The search took a sample of random initial states (I used 1023) and considered the effect of flipping each of the 64 input bits on each of the 128 output bits two rounds later. Each of the 8192 pairs can be considered a biased coin, and adding up the Shannon entropy of all of them produces a score. The best-scoring shifts also did well in other tests (flipping bits in y, trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits), so the choice was made with the additional constraint that the sum of the shifts is odd and not too close to the word size. The final state is then folded into a 32-bit hash value by a less carefully optimized multiply-based scheme. This also has to be fast, as pathname components tend to be short (the most common case is one iteration!), but there's some room for latency, as there is a fair bit of intervening logic before the hash value is used for anything. (Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need a better benchmark; the numbers seem to show a slight dip in performance between 4.6.0 and this patch, but they're too noisy to quote.) Special thanks to Bruce fields for diligent testing which uncovered a nasty fencepost error in an earlier version of this patch. [checkpatch.pl formatting complaints noted and respectfully disagreed with.] Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net> Tested-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 George Spelvin 提交于
We'd like to make more use of the highly-optimized dcache hash functions throughout the kernel, rather than have every subsystem create its own, and a function that hashes basic null-terminated strings is required for that. (The name is to emphasize that it returns both hash and length.) It's actually useful in the dcache itself, specifically d_alloc_name(). Other uses in the next patch. full_name_hash() is also tweaked to make it more generally useful: 1) Take a "char *" rather than "unsigned char *" argument, to be consistent with hash_name(). 2) Handle zero-length inputs. If we want more callers, we don't want to make them worry about corner cases. Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
-