- 23 2月, 2016 9 次提交
-
-
由 Jaegeuk Kim 提交于
The sceanrio is: 1. create fully node blocks 2. flush node blocks 3. write inline_data for all the node blocks again 4. flush node blocks redundantly So, this patch tries to flush inline_data when flushing node blocks. Reviewed-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
We should consider data block allocation to trigger f2fs_balance_fs. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
The scenario is: 1. create lots of node blocks 2. sync 3. write lots of inline_data -> got panic due to no free space In that case, we should flush node blocks when writing inline_data in #3, and trigger gc as well. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
If there are many writepages calls by multiple threads in background, we don't need to serialize to merge all the bios, since it's background. In such the case, it'd better to run writepages concurrently. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This patch removes needless condition variable. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
get_new_segment starts from current segment position, tries to search a free segment among its right neighbors locate in same section. But previously our search area was set as [current segment, max segment], which means we have to search to more bits in free_segmap bitmap for some worse cases. So here we correct the search area to [current segment, last segment in section] to avoid unnecessary searching. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold of dirty nat entries, if current ratio exceeds configured threshold, checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
When testing f2fs with xfstest, generic/251 is stuck for long time, the case uses below serials to obtain fresh released space in device, in order to prepare for following fstrim test. 1. rm -rf /mnt/dir 2. mkdir /mnt/dir/ 3. cp -axT `pwd`/ /mnt/dir/ 4. goto 1 During preparing step, all nat entries will be cached in nat cache, most of them are dirty entries with invalid blkaddr, which means nodes related to these entries have been truncated, and they could be reused after the dirty entries been checkpointed. However, there was no checkpoint been triggered, so nid allocators (e.g. mkdir, creat) will run into long journey of iterating all NAT pages, looking for free nids in alloc_nid->build_free_nids. Here, in f2fs_balance_fs_bg we give another chance to do checkpoint to flush nat entries for reusing them in free nid cache when dirty entry count exceeds 10% of max count. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
Operations in is_merged_page is related to inner bio cache, move it to data.c. Signed-off-by: NChao Yu <chao2.yu@samsung.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 19 2月, 2016 4 次提交
-
-
由 Jan Kara 提交于
Competing overwrite DIO in dioread_nolock mode will just overwrite pointer to io_end in the inode. This may result in data corruption or extent conversion happening from IO completion interrupt because we don't properly set buffer_defer_completion() when unlocked DIO races with locked DIO to unwritten extent. Since unlocked DIO doesn't need io_end for anything, just avoid allocating it and corrupting pointer from inode for locked DIO. A cleaner fix would be to avoid these games with io_end pointer from the inode but that requires more intrusive changes so we leave that for later. Cc: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jan Kara 提交于
ext4 can update bh->b_state non-atomically in _ext4_get_block() and ext4_da_get_block_prep(). Usually this is fine since bh is just a temporary storage for mapping information on stack but in some cases it can be fully living bh attached to a page. In such case non-atomic update of bh->b_state can race with an atomic update which then gets lost. Usually when we are mapping bh and thus updating bh->b_state non-atomically, nobody else touches the bh and so things work out fine but there is one case to especially worry about: ext4_finish_bio() uses BH_Uptodate_Lock on the first bh in the page to synchronize handling of PageWriteback state. So when blocksize < pagesize, we can be atomically modifying bh->b_state of a buffer that actually isn't under IO and thus can race e.g. with delalloc trying to map that buffer. The result is that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock. Fix the problem by always updating bh->b_state bits atomically. CC: stable@vger.kernel.org Reported-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jeff Layton 提交于
We don't require a dedicated thread for fsnotify cleanup. Switch it over to a workqueue job instead that runs on the system_unbound_wq. In the interest of not thrashing the queued job too often when there are a lot of marks being removed, we delay the reaper job slightly when queueing it, to allow several to gather on the list. Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Tested-by: NEryu Guan <guaneryu@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeff Layton 提交于
This reverts commit c510eff6 ("fsnotify: destroy marks with call_srcu instead of dedicated thread"). Eryu reported that he was seeing some OOM kills kick in when running a testcase that adds and removes inotify marks on a file in a tight loop. The above commit changed the code to use call_srcu to clean up the marks. While that does (in principle) work, the srcu callback job is limited to cleaning up entries in small batches and only once per jiffy. It's easily possible to overwhelm that machinery with too many call_srcu callbacks, and Eryu's reproduer did just that. There's also another potential problem with using call_srcu here. While you can obviously sleep while holding the srcu_read_lock, the callbacks run under local_bh_disable, so you can't sleep there. It's possible when putting the last reference to the fsnotify_mark that we'll end up putting a chain of references including the fsnotify_group, uid, and associated keys. While I don't see any obvious ways that that could occurs, it's probably still best to avoid using call_srcu here after all. This patch reverts the above patch. A later patch will take a different approach to eliminated the dedicated thread here. Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Reported-by: NEryu Guan <guaneryu@gmail.com> Tested-by: NEryu Guan <guaneryu@gmail.com> Cc: Jan Kara <jack@suse.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 2月, 2016 2 次提交
-
-
由 Tahsin Erdogan 提交于
inode struct members that track cgroup writeback information should be reinitialized when inode gets allocated from kmem_cache. Otherwise, their values remain and get used by the new inode. Signed-off-by: NTahsin Erdogan <tahsin@google.com> Acked-by: NTejun Heo <tj@kernel.org> Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
If cgroup writeback is in use, an inode is associated with a cgroup for writeback. If the inode's main dirtier changes to another cgroup, the association gets updated asynchronously. Nothing was pinning the superblock while such switches are in progress and superblock could go away while async switching is pending or in progress leading to crashes like the following. kernel BUG at fs/jbd2/transaction.c:319! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51 Hardware name: Google Google, BIOS Google 01/01/2011 Workqueue: events inode_switch_wbs_work_fn task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000 RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0 RSP: 0018:ffff880209267c30 EFLAGS: 00010202 ... Call Trace: [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190 [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70 [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0 [<ffffffff8035338b>] evict+0xbb/0x190 [<ffffffff80354190>] iput+0x130/0x190 [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0 [<ffffffff80279819>] process_one_work+0x129/0x300 [<ffffffff80279b16>] worker_thread+0x126/0x480 [<ffffffff8027ed14>] kthread+0xc4/0xe0 [<ffffffff809771df>] ret_from_fork+0x3f/0x70 Fix it by bumping s_active while cgroup association switching is in flight. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NTahsin Erdogan <tahsin@google.com> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching") Cc: stable@vger.kernel.org #v4.5+ Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 16 2月, 2016 2 次提交
-
-
由 Kirill Tkhai 提交于
When ext4_bread() fails, fname_crypto_str remains allocated after return. Fix that. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
-
由 Filipe Manana 提交于
If a bio for a direct IO request fails, we were not setting the error in the parent bio (the main DIO bio), making us not return the error to user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO() return the number of bytes issued for IO and not the error a bio created and submitted by btrfs_submit_direct() got from the block layer. This essentially happens because when we call: dio_end_io(dio_bio, bio->bi_error); It does not set dio_bio->bi_error to the value of the second argument. So just add this missing assignment in endio callbacks, just as we do in the error path at btrfs_submit_direct() when we fail to clone the dio bio or allocate its private object. This follows the convention of what is done with other similar APIs such as bio_endio() where the caller is responsible for setting the bi_error field in the bio it passes as an argument to bio_endio(). This was detected by the new generic test cases in xfstests: 271, 272, 276 and 278. Which essentially setup a dm error target, then load the error table, do a direct IO write and unload the error table. They expect the write to fail with -EIO, which was not getting reported when testing against btrfs. Cc: stable@vger.kernel.org # 4.3+ Fixes: 4246a0b6 ("block: add a bi_error field to struct bio") Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
- 12 2月, 2016 6 次提交
-
-
由 Eryu Guan 提交于
The "newblock" parameter is not used in convert_initialized_extent(), remove it. Signed-off-by: NEryu Guan <guaneryu@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Eryu Guan 提交于
I notice ext4/307 fails occasionally on ppc64 host, reporting md5 checksum mismatch after moving data from original file to donor file. The reason is that move_extent_per_page() calls __block_write_begin() and block_commit_write() to write saved data from original inode blocks to donor inode blocks, but __block_write_begin() not only maps buffer heads but also reads block content from disk if the size is not block size aligned. At this time the physical block number in mapped buffer head is pointing to the donor file not the original file, and that results in reading wrong data to page, which get written to disk in following block_commit_write call. This also can be reproduced by the following script on 1k block size ext4 on x86_64 host: mnt=/mnt/ext4 donorfile=$mnt/donor testfile=$mnt/testfile e4compact=~/xfstests/src/e4compact rm -f $donorfile $testfile # reserve space for donor file, written by 0xaa and sync to disk to # avoid EBUSY on EXT4_IOC_MOVE_EXT xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile # create test file written by 0xbb xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile # compute initial md5sum md5sum $testfile | tee md5sum.txt # drop cache, force e4compact to read data from disk echo 3 > /proc/sys/vm/drop_caches # test defrag echo "$testfile" | $e4compact -i -v -f $donorfile # check md5sum md5sum -c md5sum.txt Fix it by creating & mapping buffer heads only but not reading blocks from disk, because all the data in page is guaranteed to be up-to-date in mext_page_mkuptodate(). Cc: stable@vger.kernel.org Signed-off-by: NEryu Guan <guaneryu@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Insu Yun 提交于
Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data), integer overflow could be happened. Therefore, need to fix integer overflow sanitization. Cc: stable@vger.kernel.org Signed-off-by: NInsu Yun <wuninsu@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Huaitong Han 提交于
This patch adds a line break for proc mb_groups display. Signed-off-by: NHuaitong Han <huaitong.han@intel.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
-
由 Anton Protopopov 提交于
The ext4_ioctl_setflags() function which is used in the ioctls EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value EPERM instead of -EPERM in case of error. This bug was introduced by a recent commit 9b7365fc. The following program can be used to illustrate the wrong behavior: #include <sys/types.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <fcntl.h> #include <err.h> #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IMMUTABLE_FL 0x00000010 int main(void) { int fd; long flags; fd = open("file", O_RDWR|O_CREAT, 0600); if (fd < 0) err(1, "open"); if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0) err(1, "ioctl: FS_IOC_GETFLAGS"); flags |= FS_IMMUTABLE_FL; if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0) err(1, "ioctl: FS_IOC_SETFLAGS"); warnx("ioctl returned no error"); return 0; } Running it gives the following result: $ strace -e ioctl ./test ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0 ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1 test: ioctl returned no error +++ exited with 0 +++ Running the program on a kernel with the bug fixed gives the proper result: $ strace -e ioctl ./test ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0 ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted) test: ioctl: FS_IOC_SETFLAGS: Operation not permitted +++ exited with 1 +++ Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jan Kara 提交于
When block group checksum is wrong, we call ext4_error() while holding group spinlock from ext4_init_block_bitmap() or ext4_init_inode_bitmap() which results in scheduling while in atomic. Fix the issue by calling ext4_error() later after dropping the spinlock. CC: stable@vger.kernel.org Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 11 2月, 2016 5 次提交
-
-
由 David Sterba 提交于
The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com> CC: stable@vger.kernel.org Signed-off-by: NDavid Sterba <dsterba@suse.com> Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anton Protopopov 提交于
The setup_ntlmv2_rsp() function may return positive value ENOMEM instead of -ENOMEM in case of kmalloc failure. Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Insu Yun 提交于
In worst case, "ip=" + sb_mountdata + ipv6 can be copied into mountdata. Therefore, for safe, it is better to add more size when allocating memory. Signed-off-by: NInsu Yun <wuninsu@gmail.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Colin Ian King 提交于
server_RFC1001_name is declared as a RFC1001_NAME_LEN_WITH_NULL sized char array in struct TCP_Server_Info so the null pointer check on server_RFC1001_name is redundant and can be removed. Detected with smatch: fs/cifs/connect.c:2982 ip_rfc1001_connect() warn: this array is probably non-NULL. 'server->server_RFC1001_name' Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Peter Jones 提交于
"rm -rf" is bricking some peoples' laptops because of variables being used to store non-reinitializable firmware driver data that's required to POST the hardware. These are 100% bugs, and they need to be fixed, but in the mean time it shouldn't be easy to *accidentally* brick machines. We have to have delete working, and picking which variables do and don't work for deletion is quite intractable, so instead make everything immutable by default (except for a whitelist), and make tools that aren't quite so broad-spectrum unset the immutable flag. Signed-off-by: NPeter Jones <pjones@redhat.com> Tested-by: NLee, Chun-Yi <jlee@suse.com> Acked-by: NMatthew Garrett <mjg59@coreos.com> Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
-
- 10 2月, 2016 1 次提交
-
-
由 Peter Jones 提交于
Translate EFI's UCS-2 variable names to UTF-8 instead of just assuming all variable names fit in ASCII. Signed-off-by: NPeter Jones <pjones@redhat.com> Acked-by: NMatthew Garrett <mjg59@coreos.com> Tested-by: NLee, Chun-Yi <jlee@suse.com> Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
-
- 08 2月, 2016 3 次提交
-
-
由 Theodore Ts'o 提交于
In the case where the per-file key for the directory is cached, but root does not have access to the key needed to derive the per-file key for the files in the directory, we allow the lookup to succeed, so that lstat(2) and unlink(2) can suceed. However, if a program tries to open the file, it will get an ENOKEY error. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Theodore Ts'o 提交于
Add a validation check for dentries for encrypted directory to make sure we're not caching stale data after a key has been added or removed. Also check to make sure that status of the encryption key is updated when readdir(2) is executed. Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Darrick J. Wong 提交于
Since the checksum function and the field are both __le32, don't perform endian conversion when comparing the two. This fixes mount failures on ppc64. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com>
-
- 07 2月, 2016 1 次提交
-
-
由 Herton R. Krzesinski 提交于
Considering current pty code and multiple devpts instances, it's possible to umount a devpts file system while a program still has /dev/tty opened pointing to a previosuly closed pty pair in that instance. In the case all ptmx and pts/N files are closed, umount can be done. If the program closes /dev/tty after umount is done, devpts_kill_index will use now an invalid super_block, which was already destroyed in the umount operation after running ->kill_sb. This is another "use after free" type of issue, but now related to the allocated super_block instance. To avoid the problem (warning at ida_remove and potential crashes) for this specific case, I added two functions in devpts which grabs additional references to the super_block, which pty code now uses so it makes sure the super block structure is still valid until pty shutdown is done. I also moved the additional inode references to the same functions, which also covered similar case with inode being freed before /dev/tty final close/shutdown. Signed-off-by: NHerton R. Krzesinski <herton@redhat.com> Cc: stable@vger.kernel.org # 2.6.29+ Reviewed-by: NPeter Hurley <peter@hurleysoftware.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 06 2月, 2016 4 次提交
-
-
由 Jason Baron 提交于
In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1), if epoll waiters create different POLL* sets and register them as exclusive against the same target fd, the current implementation will stop waking any further waiters once it finds the first idle waiter. This means that waiters could miss wakeups in certain cases. For example, when we wake up a pipe for reading we do: wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if one epoll set or epfd is added to pipe p with POLLIN and a second set epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup since the current implementation will stop after it finds any intersection of events with a waiter that is blocked in epoll_wait(). We could potentially address this by requiring all epoll waiters that are added to p be required to pass the same set of POLL* events. IE the first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think it might be somewhat confusing interface as we would have to reference count the number of users for that set, and so userspace would have to keep track of that count, or we would need a more involved interface. It also adds some shared state that we'd have store somewhere. I don't think anybody will want to bloat __wait_queue_head for this. I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the first idle waiter that specifies the EPOLLIN bit, since any remaining waiters that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one example of this I saw in fs/pipe.c), then we just wake the entire exclusive list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any performance critical path, so I think that's ok (in fs/pipe.c its in pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but is not required to do so. Since epoll waiters may be interested in other events as well besides EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT events are what we are interest in balancing, I think that the 'dup' thing could perhaps be added to only one of the waiter threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be sufficient for the majority of use-cases. Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among multiple epfds, where between 1 and n of the epfds may receive an event, it does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an event. Thus, it is not allowed to be specified in conjunction with EPOLLEXCLUSIVE. EPOLL_CTL_MOD is also not allowed if the fd was previously added as EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as interesting, but this could be relaxed at some further point. Signed-off-by: NJason Baron <jbaron@akamai.com> Tested-by: NMadars Vitolins <m@silodev.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dmitry Monakhov 提交于
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 xuejiufei 提交于
When recovery master down, dlm_do_local_recovery_cleanup() only remove the $RECOVERY lock owned by dead node, but do not clear the refmap bit. Which will make umount thread falling in dead loop migrating $RECOVERY to the dead node. Signed-off-by: Nxuejiufei <xuejiufei@huawei.com> Reviewed-by: NJoseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Previously the pfn_mkwrite() fault handler for raw block devices called bldev_dax_fault() -> __dax_fault() to do a full DAX page fault. Really what the pfn_mkwrite() fault handler needs to do is call dax_pfn_mkwrite() to make sure that the radix tree entry for the given PTE is marked as dirty so that a follow-up fsync or msync call will flush it durably to media. Fixes: 5a023cdb ("block: enable dax for raw block devices") Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 2月, 2016 3 次提交
-
-
由 Filipe Manana 提交于
While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[<ffffffff810902af>] [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41 [39389.800016] [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44 [39389.800016] [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800016] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800016] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[<ffffffff81091e8d>] [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c [39389.800012] [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41 [39389.800012] [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28 [39389.800012] [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf [39389.800012] [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800012] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800012] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Cc: stable@vger.kernel.org Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
由 Yan, Zheng 提交于
Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Dan Carpenter 提交于
ceph_osdc_alloc_request() returns NULL on error, it never returns error pointers. Fixes: 5be0389d ('ceph: re-send AIO write request when getting -EOLDSNAP error') Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-