- 13 1月, 2017 2 次提交
-
-
由 Geng, Jichao 提交于
For no snapshot case, we should use ci->truncate_{seq,size}. Fixes: 5f743e45 ("ceph: record truncate size/seq for snap data writeback") Signed-off-by: NGeng, Jichao <geng.jichao@h3c.com> Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
由 Yan, Zheng 提交于
We should apply the check after getting the initial mdsmap. Fixes: e9e427f0 ("ceph: check availability of mds cluster on mount") Link: http://tracker.ceph.com/issues/18161Signed-off-by: NYan, Zheng <zyan@redhat.com>
-
- 12 1月, 2017 2 次提交
-
-
由 Damien Le Moal 提交于
All block device data fields and functions returning a number of 512B sectors are by convention named xxx_sectors while names in the form xxx_size are generally used for a number of bytes. The blk_queue_zone_size and bdev_zone_size functions were not following this convention so rename them. No functional change is introduced by this patch. Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Collapsed the two patches, they were nonsensically split and broke bisection. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jan Kara 提交于
Commit 99579cce "xfs: skip dirty pages in ->releasepage()" started to skip dirty pages in xfs_vm_releasepage() which also has the effect that if a dirty page is truncated, it does not get freed by block_invalidatepage() and is lingering in LRU list waiting for reclaim. So a simple loop like: while true; do dd if=/dev/zero of=file bs=1M count=100 rm file done will keep using more and more memory until we hit low watermarks and start pagecache reclaim which will eventually reclaim also the truncate pages. Keeping these truncated (and thus never usable) pages in memory is just a waste of memory, is unnecessarily stressing page cache reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM prematurely. So instead of just skipping dirty pages in xfs_vm_releasepage(), return to old behavior of skipping them only if they have delalloc or unwritten buffers and fix the spurious warnings by warning only if the page is clean. CC: stable@vger.kernel.org CC: Brian Foster <bfoster@redhat.com> CC: Vlastimil Babka <vbabka@suse.cz> Reported-by: NPetr Tůma <petr.tuma@d3s.mff.cuni.cz> Fixes: 99579cceSigned-off-by: NJan Kara <jack@suse.cz> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 11 1月, 2017 3 次提交
-
-
由 Eric Ren 提交于
The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) ocfs2: Beginning quota recovery on device (253,18) for slot 2 ocfs2: Finishing quota recovery on device (253,18) for slot 2 (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 ------------[ cut here ]------------ kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unsupported modules are loaded CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 Call Trace: ocfs2_setattr+0x698/0xa90 [ocfs2] notify_change+0x1ae/0x380 do_truncate+0x5e/0x90 do_sys_ftruncate.constprop.11+0x108/0x160 entry_SYSCALL_64_fastpath+0x12/0x6d Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. */ if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to return; * to invalid the LVB here. */ The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */ ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */ ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com> Reviewed-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd_t dirty and writeable 2) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pmd_t as clean and write protected. 3) more mmap writes to the PMD. These don't cause any page faults since the pmd_t is dirty and writeable. The radix tree entry remains clean. 4) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 5) crash - dirty data that should have been fsync'd as part of 4) could still have been in the processor cache, and is lost. Fix this by marking the pmd_t clean and write protected in dax_mapping_entry_mkclean(), which is called as part of the fsync operation 2). This will cause the writes in step 3) above to generate page faults where we'll re-dirty the PMD radix tree entry, resulting in flushes in the fsync that happens in step 4). Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush") Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chandan Rajendra 提交于
The code currently uses sdio->blkbits to compute the number of blocks to be cleaned. However sdio->blkbits is derived from the logical block size of the underlying block device (Refer to the definition of do_blockdev_direct_IO()). Due to this, generic/299 test would rarely fail when executed on an ext4 filesystem with 64k as the block size and when using a virtio based disk (having 512 byte as the logical block size) inside a kvm guest. This commit fixes the bug by using inode->i_blkbits to compute the number of blocks to be cleaned. Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Fixed up by Jeff Moyer to only use/evaluate inode->i_blkbits once, to avoid issues with block size changes with IO in flight. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 1月, 2017 5 次提交
-
-
由 Christoph Hellwig 提交于
There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail: one is an I/O error, for which xlog_bdstrat already logs a warning, and the second is an already shutdown log due to a previous I/O errors. In the latter case we'll already have a previous indication for the actual error, but the large stream of misleading warnings from xfs_log_force will probably scroll it out of the message buffer. Simply removing the warnings thus makes the XFS log reporting significantly better. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
->total is a bit of an odd parameter passed down to the low-level allocator all the way from the high-level callers. It's supposed to contain the maximum number of blocks to be allocated for the whole transaction [1]. But in xfs_iomap_write_allocate we only convert existing delayed allocations and thus only have a minimal block reservation for the current transaction, so xfs_alloc_space_available can't use it for the allocation decisions. Use the maximum of args->total and the calculated block requirement to make a decision. We probably should get rid of args->total eventually and instead apply ->minleft more broadly, but that will require some extensive changes all over. [1] which creates lots of confusion as most callers don't decrement it once doing a first allocation. But that's for a separate series. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
We must decide in xfs_alloc_fix_freelist if we can perform an allocation from a given AG is possible or not based on the available space, and should not fail the allocation past that point on a healthy file system. But currently we have two additional places that second-guess xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the maxlen parameter to remove the reservation before doing the allocation (but ignores the various minium freespace requirements), and xfs_alloc_fix_minleft tries to fix up the allocated length after we've found an extent, but ignores the reservations and also doesn't take the AGFL into account (and thus fails allocations for not matching minlen in some cases). Remove all these later fixups and just correct the maxlen argument inside xfs_alloc_fix_freelist once we have the AGF buffer locked. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
We can't just set minleft to 0 when we're low on space - that's exactly what we need minleft for: to protect space in the AG for btree block allocations when we are low on free space. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Setting aside 4 blocks globally for bmbt splits isn't all that useful, as different threads can allocate space in parallel. Bump it to 4 blocks per AG to allow each thread that is currently doing an allocation to dip into it separately. Without that we may no have enough reserved blocks if there are enough parallel transactions in an almost out space file system that all run into bmap btree splits. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 09 1月, 2017 2 次提交
-
-
由 Liu Bo 提交于
'inode' is an important field for btrfs_get_extent, lets trace it. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Enabling btrfs tracepoints leads to instant crash, as reported. The wq callbacks could free the memory and the tracepoints started to dereference the members to get to fs_info. The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2 removed the tracepoints but we could preserve them by passing only the required data in a safe way. Fixes: bc074524 ("btrfs: prefix fsid to all trace events") CC: stable@vger.kernel.org # 4.8+ Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 04 1月, 2017 5 次提交
-
-
由 Carlos Maiolino 提交于
max_retries _show and _store functions should test against cfg->max_retries, not cfg->retry_timeout Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
There is a race window between write_cache_pages calling clear_page_dirty_for_io and XFS calling set_page_writeback, in which the mapping for an inode is tagged neither as dirty, nor as writeback. If the COW shrinker hits in exactly that window we'll remove the delayed COW extents and writepages trying to write it back, which in release kernels will manifest as corruption of the bmap btree, and in debug kernels will trip the ASSERT about now calling xfs_bmapi_write with the COWFORK flag for holes. A complex customer load manages to hit this window fairly reliably, probably by always having COW writeback in flight while the cow shrinker runs. This patch adds another check for having the I_DIRTY_PAGES flag set, which is still set during this race window. While this fixes the problem I'm still not overly happy about the way the COW shrinker works as it still seems a bit fragile. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Darrick J. Wong 提交于
We need to use the actual AG length when making per-AG reservations, since we could otherwise end up reserving more blocks out of the last AG than there are actual blocks. Complained-about-by: NBrian Foster <bfoster@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
由 Darrick J. Wong 提交于
Dan Carpenter reported a double-free of rcur if _defer_finish fails while we're recovering CUI items. Fix the error recovery to prevent this. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Liu Bo 提交于
Currently how btrfs dio deals with split dio write is not good enough if dio write is split into several segments due to the lack of contiguous space, a large dio write like 'dd bs=1G count=1' can end up with incorrect outstanding_extents counter and endio would complain loudly with an assertion. This fixes the problem by compensating the outstanding_extents counter in inode if a large dio write gets split. Reported-by: NAnand Jain <anand.jain@oracle.com> Tested-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 03 1月, 2017 5 次提交
-
-
由 Liu Bo 提交于
While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a different inode's log_mutex with holding the current inode's log_mutex, and lockdep has complained this with a possilble deadlock warning. Fix this by using mutex_lock_nested() when processing the other inode's log_mutex. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
If @block_group is not @used_bg, it'll try to get @used_bg's lock without droping @block_group 's lock and lockdep has throwed a scary deadlock warning about it. Fix it by using down_read_nested. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
In __btrfs_run_delayed_refs, when we put back a delayed ref that's too new, we have already dropped the lock on locked_ref when we set ->processing = 0. This patch keeps the lock to cover that assignment. Fixes: d7df2c79 (Btrfs: attach delayed ref updates to delayed ref heads) Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op fails sets locked_ref->processing = 0 but doesn't re-increment delayed_refs->num_heads_ready. As a result, we end up triggering the WARN_ON in btrfs_select_ref_head. Fixes: d7df2c79 (Btrfs: attach delayed ref updates to delayed ref heads) Reported-by: NJon Nelson <jnelson-suse@jamponi.net> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Chandan Rajendra 提交于
The first block to be cleaned may start at a non-zero page offset. In such a scenario clean_bdev_aliases() will end up cleaning blocks that do not fall in the range of blocks to be cleaned. This commit fixes the issue by skipping blocks that do not fall in valid block range. Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NEryu Guan <eguan@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 31 12月, 2016 1 次提交
-
-
由 Eric Biggers 提交于
Attempting to link a device node, named pipe, or socket file into an encrypted directory through rename(2) or link(2) always failed with EPERM. This happened because fscrypt_has_permitted_context() saw that the file was unencrypted and forbid creating the link. This behavior was unexpected because such files are never encrypted; only regular files, directories, and symlinks can be encrypted. To fix this, make fscrypt_has_permitted_context() always return true on special files. This will be covered by a test in my encryption xfstests patchset. Fixes: 9bd8212f ("ext4 crypto: add encryption policy and password salt support") Signed-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NRichard Weinberger <richard@nod.at> Cc: stable@vger.kernel.org Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 28 12月, 2016 1 次提交
-
-
由 Theodore Ts'o 提交于
Commit f1c131b4: "crypto: xts - Convert to skcipher" now fails the setkey operation if the AES key is the same as the tweak key. Previously this check was only done if FIPS mode is enabled. Now this check is also done if weak key checking was requested. This is reasonable, but since we were using the dummy key which was a constant series of 0x42 bytes, it now caused dummy encrpyption test mode to fail. Fix this by using 0x42... and 0x24... for the two keys, so they are different. Fixes: f1c131b4 Cc: stable@vger.kernel.org Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 27 12月, 2016 6 次提交
-
-
由 Jan Kara 提交于
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we can use transaction starting in ext4_iomap_begin() and thus simplify ext4_dax_fault(). It also provides us proper retries in case of ENOSPC. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Currently ->iomap_begin() handler is called with entry lock held. If the filesystem held any locks between ->iomap_begin() and ->iomap_end() (such as ext4 which will want to hold transaction open), this would cause lock inversion with the iomap_apply() from standard IO path which first calls ->iomap_begin() and only then calls ->actor() callback which grabs entry locks for DAX (if it faults when copying from/to user provided buffers). Fix the problem by nesting grabbing of entry lock inside ->iomap_begin() - ->iomap_end() pair. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
The only case when we do not finish the page fault completely is when we are loading hole pages into a radix tree. Avoid this special case and finish the fault in that case as well inside the DAX fault handler. It will allow us for easier iomap handling. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Currently dax_iomap_rw() takes care of invalidating page tables and evicting hole pages from the radix tree when write(2) to the file happens. This invalidation is only necessary when there is some block allocation resulting from write(2). Furthermore in current place the invalidation is racy wrt page fault instantiating a hole page just after we have invalidated it. So perform the page invalidation inside dax_iomap_actor() where we can do it only when really necessary and after blocks have been allocated so nobody will be instantiating new hole pages anymore. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages() just delete all exceptional radix tree entries they find. For DAX this is not desirable as we track cache dirtiness in these entries and when they are evicted, we may not flush caches although it is necessary. This can for example manifest when we write to the same block both via mmap and via write(2) (to different offsets) and fsync(2) then does not properly flush CPU caches when modification via write(2) was the last one. Create appropriate DAX functions to handle invalidation of DAX entries for invalidate_inode_pages2_range() and invalidate_mapping_pages() and wire them up into the corresponding mm functions. Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Jan Kara 提交于
So far we did not return BH_New buffers from ext2_get_blocks() when we allocated and zeroed-out a block for DAX inode to avoid racy zeroing in DAX code. This zeroing is gone these days so we can remove the workaround. Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 26 12月, 2016 3 次提交
-
-
由 Thomas Gleixner 提交于
No point in going through loops and hoops instead of just comparing the values. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Thomas Gleixner 提交于
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Thomas Gleixner 提交于
ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 25 12月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 12月, 2016 1 次提交
-
-
由 Jan Kara 提交于
There are only two calls sites of fsnotify_duplicate_mark(). Those are in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused for audit tree, inode pointer and group gets set in fsnotify_add_mark_locked() later anyway, mask and free_mark are already set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is actively harmful because following fsnotify_add_mark_locked() will leak group reference by overwriting the group pointer. So just remove the two calls to fsnotify_duplicate_mark() and the function. Signed-off-by: NJan Kara <jack@suse.cz> [PM: line wrapping to fit in 80 chars] Signed-off-by: NPaul Moore <paul@paul-moore.com>
-
- 23 12月, 2016 3 次提交
-
-
由 Jeff Layton 提交于
sparse says: fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static? Note that the forward declaration in the file is already marked static. Signed-off-by: NJeff Layton <jlayton@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Aleksa Sarai 提交于
If you have a process that has set itself to be non-dumpable, and it then undergoes exec(2), any CLOEXEC file descriptors it has open are "exposed" during a race window between the dumpable flags of the process being reset for exec(2) and CLOEXEC being applied to the file descriptors. This can be exploited by a process by attempting to access /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE. The race in question is after set_dumpable has been (for get_link, though the trace is basically the same for readlink): [vfs] -> proc_pid_link_inode_operations.get_link -> proc_pid_get_link -> proc_fd_access_allowed -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); Which will return 0, during the race window and CLOEXEC file descriptors will still be open during this window because do_close_on_exec has not been called yet. As a result, the ordering of these calls should be reversed to avoid this race window. This is of particular concern to container runtimes, where joining a PID namespace with file descriptors referring to the host filesystem can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect against access of CLOEXEC file descriptors -- file descriptors which may reference filesystem objects the container shouldn't have access to). Cc: dev@opencontainers.org Cc: <stable@vger.kernel.org> # v3.2+ Reported-by: NMichael Crosby <crosbymichael@gmail.com> Signed-off-by: NAleksa Sarai <asarai@suse.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Tomasz Majchrzak 提交于
If kernfs file is empty on a first read, successive read operations using the same file descriptor will return no data, even when data is available. Default kernfs 'seq_next' implementation advances iterator position even when next object is not there. Kernfs 'seq_start' for following requests will not return iterator as position is already on the second object. This defect doesn't allow to monitor badblocks sysfs files from MD raid. They are initially empty but if data appears at some stage, userspace is not able to read it. Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-