- 03 1月, 2019 1 次提交
-
-
由 NeilBrown 提交于
After moving all requests from fl->fl_blocked_requests to new->fl_blocked_requests it is nonsensical to do anything to all the remaining elements, there aren't any. This should do something to all the requests that have been moved. For simplicity, it does it to all requests in the target list. Setting "f->fl_blocker = new" to all members of new->fl_blocked_requests is "obviously correct" as it preserves the invariant of the linkage among requests. Reported-by: syzbot+239d99847eb49ecb3899@syzkaller.appspotmail.com Fixes: 5946c431 ("fs/locks: allow a lock request to block other requests.") Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJeff Layton <jlayton@kernel.org>
-
- 29 12月, 2018 18 次提交
-
-
由 Mike Kravetz 提交于
hugetlbfs page faults can race with truncate and hole punch operations. Current code in the page fault path attempts to handle this by 'backing out' operations if we encounter the race. One obvious omission in the current code is removing a page newly added to the page cache. This is pretty straight forward to address, but there is a more subtle and difficult issue of backing out hugetlb reservations. To handle this correctly, the 'reservation state' before page allocation needs to be noted so that it can be properly backed out. There are four distinct possibilities for reservation state: shared/reserved, shared/no-resv, private/reserved and private/no-resv. Backing out a reservation may require memory allocation which could fail so that needs to be taken into account as well. Instead of writing the required complicated code for this rare occurrence, just eliminate the race. i_mmap_rwsem is now held in read mode for the duration of page fault processing. Hold i_mmap_rwsem longer in truncation and hold punch code to cover the call to remove_inode_hugepages. With this modification, code in remove_inode_hugepages checking for races becomes 'dead' as it can not longer happen. Remove the dead code and expand comments to explain reasoning. Similarly, checks for races with truncation in the page fault path can be simplified and removed. [mike.kravetz@oracle.com: incorporat suggestions from Kirill] Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com Fixes: ebed4bfc ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Currently, block device pages don't provide a ->migratepage callback and thus fallback_migrate_page() is used for them. This handler cannot deal with dirty pages in async mode and also with the case a buffer head is in the LRU buffer head cache (as it has elevated b_count). Thus such page can block memory offlining. Fix the problem by using buffer_migrate_page_norefs() for migrating block device pages. That function takes care of dropping bh LRU in case migration would fail due to elevated buffer refcount to avoid stalls and can also migrate dirty pages without writing them. Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
When the process being tracked does mremap() without UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle, we should not generate the remap event, and at the same time we should clear all the uffd flags on the new VMA. Without this patch, we can still have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault handling process does not even know the existance of the VMA. Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Pravin Shedge <pravin.shedge4linux@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
David Rientjes has reported that commit 18600332 ("mm: make PR_SET_THP_DISABLE immediately active") has changed the way how we report THPable VMAs to the userspace. Their monitoring tool is triggering false alarms on PR_SET_THP_DISABLE tasks because it considers an insufficient THP usage as a memory fragmentation resp. memory pressure issue. Before the said commit each newly created VMA inherited VM_NOHUGEPAGE flag and that got exposed to the userspace via /proc/<pid>/smaps file. This implementation had its downsides as explained in the commit message but it is true that the userspace doesn't have any means to query for the process wide THP enabled/disabled status. PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to export in the process wide context rather than per-vma. Introduce a new field to /proc/<pid>/status which export this status. If PR_SET_THP_DISABLE is used then it reports false same as when the THP is not compiled in. It doesn't consider the global THP status because we already export that information via sysfs Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org Fixes: 18600332 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reported-by: NDavid Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Acked-by: NChristian König <christian.koenig@amd.com> Acked-by: NJan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anthony Yznaga 提交于
Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.comSigned-off-by: NAnthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMatthew Wilcox <willy@infradead.org> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric Biggers 提交于
Reference counters should use refcount_t rather than atomic_t, since the refcount_t implementation can prevent overflows, reducing the exploitability of reference leak bugs. userfaultfd_ctx::refcount is a reference counter with the usual semantics, so convert it to refcount_t. Note: I replaced the BUG() on incrementing a 0 refcount with just refcount_inc(), since part of the semantics of refcount_t is that that incrementing a 0 refcount is not allowed; with CONFIG_REFCOUNT_FULL, refcount_inc() already checks for it and warns. Link: http://lkml.kernel.org/r/20181115003916.63381-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arun KS 提交于
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org> Suggested-by: NMichal Hocko <mhocko@suse.com> Suggested-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arun KS 提交于
Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org> Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag and submit the io, second wait io done, last check whether bh uptodate, if not return io error. If two sync io for the same bh were issued, it could be the first io done and set uptodate flag, but just before check that flag, the second io came in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io will return IO error. Indeed it's not necessary to clear uptodate flag, as the io end handler end_buffer_read_sync() will set or clear it based on io succeed or failed. The following message was found from a nfs server but the underlying storage returned no error. [4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5 [4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5 [4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5 [4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5 [4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5 Same issue in non sync read ocfs2_read_blocks(), fixed it as well. Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NChangwei Ge <ge.changwei@h3c.com> Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
Dirty flag of the journal should be cleared at the last stage of umount, if do it before jbd2_journal_destroy(), then some metadata in uncommitted transaction could be lost due to io error, but as dirty flag of journal was already cleared, we can't find that until run a full fsck. This may cause system panic or other corruption. Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
mount.ocfs2 ignore the inconsistent error that journal is clean but local alloc is unrecovered. After mount, local alloc not empty, then reserver cluster didn't alloc a new local alloc window, reserveration map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the following panic. This issue was reported at https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was advised to fixed during mount. But this is a very unusual inconsistent state, usually journal dirty flag should be cleared at the last stage of umount until every other things go right. We may need do further debug to check that. Any way to avoid possible futher corruption, mount should be abort and fsck should be run. (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered! found = 6518, set = 6518, taken = 8192, off = 15912372 ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode. o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode. o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777 o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes ------------[ cut here ]------------ kernel BUG at fs/ocfs2/reservations.c:507! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018 task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000 RIP: 0010:[<ffffffffa05e96a8>] [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2] Call Trace: ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2] ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2] __ocfs2_claim_clusters+0x178/0x360 [ocfs2] ocfs2_claim_clusters+0x1f/0x30 [ocfs2] ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2] ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2] ocfs2_write_begin+0x13e/0x230 [ocfs2] generic_perform_write+0xbf/0x1c0 __generic_file_write_iter+0x19c/0x1d0 ocfs2_file_write_iter+0x589/0x1360 [ocfs2] __vfs_write+0xb8/0x110 vfs_write+0xa9/0x1b0 SyS_write+0x46/0xb0 system_call_fastpath+0x18/0xd7 Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff <0f> 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85 RIP __ocfs2_resv_find_window+0x498/0x760 [ocfs2] RSP <ffff8800ea4db668> ---[ end trace 566f07529f2edf3c ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com> Acked-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Larry Chen 提交于
Included file path was hard-wired in the ocfs2 makefile, which might causes some confusion when compiling ocfs2 as an external module. Say if we compile ocfs2 module as following. cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2 cd /other/dir/ocfs2 make -C /path/to/kernel_source M=`pwd` modules Acutally, the compiler wil try to find included file in /kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2. To fix this little bug, we introduce the var $(src) provided by kbuild. $(src) means the absolute path of the running kbuild file. Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.comSigned-off-by: NLarry Chen <lchen@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
lastzero is not used after setting its value. It is safe to remove the unused variable. Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
status is not used after setting its value. It is safe to remove the unused variable. Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NJun Piao <piaojun@huawei.com> Acked-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jia Guo 提交于
Reading heartbeat data from lowest node rather than from zero, in cases where the node is not defined from zero, can reduce the number of sectors read. Here is a simple test data obtained with 'iostat -dmx dm-5 2', with two nodes in the cluster, node number 10, 20, respectively. Before optimization: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-5 0.00 0.00 0.50 0.50 0.01 0.00 11.00 0.00 1.00 1.00 1.00 1.50 0.15 After the optimization: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-5 0.00 0.00 0.50 0.50 0.00 0.00 6.00 0.00 0.50 1.00 0.00 0.50 0.05 Link: http://lkml.kernel.org/r/99fe4988-69ac-3615-a218-3042fe6fbe72@huawei.comSigned-off-by: NJia Guo <guojia12@huawei.com> Reviewed-by: NJun Piao <piaojun@huawei.com> Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com> Acked-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 12月, 2018 14 次提交
-
-
由 Jaegeuk Kim 提交于
There is a security report where f2fs_getxattr() has a hole to expose wrong memory region when the image is malformed like this. f2fs_getxattr: entry->e_name_len: 4, size: 12288, buffer_size: 16384, len: 4 Cc: <stable@vger.kernel.org> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Sahitya Tummala 提交于
iput() on sbi->node_inode can update sbi->stat_info in the below context, if the f2fs_write_checkpoint() has failed with error. f2fs_balance_fs_bg+0x1ac/0x1ec f2fs_write_node_pages+0x4c/0x260 do_writepages+0x80/0xbc __writeback_single_inode+0xdc/0x4ac writeback_single_inode+0x9c/0x144 write_inode_now+0xc4/0xec iput+0x194/0x22c f2fs_put_super+0x11c/0x1e8 generic_shutdown_super+0x70/0xf4 kill_block_super+0x2c/0x5c kill_f2fs_super+0x44/0x50 deactivate_locked_super+0x60/0x8c deactivate_super+0x68/0x74 cleanup_mnt+0x40/0x78 Fix this by moving f2fs_destroy_stats() further below iput() in both f2fs_put_super() and f2fs_fill_super() paths. Signed-off-by: NSahitya Tummala <stummala@codeaurora.org> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
For all ordered cases in f2fs_wait_on_page_writeback(), we need to check PageWriteback status, so let's clean up to relocate the check into f2fs_wait_on_page_writeback(). Signed-off-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Martin Blumenstingl 提交于
Treat "block_count" from struct f2fs_super_block as 64-bit little endian value in sanity_check_raw_super() because struct f2fs_super_block declares "block_count" as "__le64". This fixes a bug where the superblock validation fails on big endian devices with the following error: F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0) F2FS-fs (sda1): Can't find valid F2FS filesystem in 1th superblock F2FS-fs (sda1): Wrong segment_count / block_count (61439 > 0) F2FS-fs (sda1): Can't find valid F2FS filesystem in 2th superblock As result of this the partition cannot be mounted. With this patch applied the superblock validation works fine and the partition can be mounted again: F2FS-fs (sda1): Mounted with checkpoint version = 7c84 My little endian x86-64 hardware was able to mount the partition without this fix. To confirm that mounting f2fs filesystems works on big endian machines again I tested this on a 32-bit MIPS big endian (lantiq) device. Fixes: 0cfe75c5 ("f2fs: enhance sanity_check_raw_super() to avoid potential overflows") Cc: stable@vger.kernel.org Signed-off-by: NMartin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
This fixes missing unlock call. Cc: <stable@vger.kernel.org> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
If user change inode's i_flags via ioctl, let's add it into global dirty list, so that checkpoint can guarantee its persistence before fsync, it can make checkpoint keeping strong consistency. Signed-off-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
The union in struct extent_node wass only to indicate below fields struct rb_node rb_node; union { struct { unsigned int fofs; unsigned int len; ... ... can be parsed as fields in struct rb_entry, but they were never be used explicitly before, so let's remove them for cleanup. Signed-off-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Qiuyang Sun 提交于
Should use lstart (logical start address) instead of start (in dev) here. This fixes a bug in multi-device scenarios. Signed-off-by: NQiuyang Sun <sunqiuyang@huawei.com> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Sahitya Tummala 提交于
When there is a failure in f2fs_fill_super() after/during the recovery of fsync'd nodes, it frees the current sbi and retries again. This time the mount is successful, but the files that got recovered before retry, still holds the extent tree, whose extent nodes list is corrupted since sbi and sbi->extent_list is freed up. The list_del corruption issue is observed when the file system is getting unmounted and when those recoverd files extent node is being freed up in the below context. list_del corruption. prev->next should be fffffff1e1ef5480, but was (null) <...> kernel BUG at kernel/msm-4.14/lib/list_debug.c:53! lr : __list_del_entry_valid+0x94/0xb4 pc : __list_del_entry_valid+0x94/0xb4 <...> Call trace: __list_del_entry_valid+0x94/0xb4 __release_extent_node+0xb0/0x114 __free_extent_tree+0x58/0x7c f2fs_shrink_extent_tree+0xdc/0x3b0 f2fs_leave_shrinker+0x28/0x7c f2fs_put_super+0xfc/0x1e0 generic_shutdown_super+0x70/0xf4 kill_block_super+0x2c/0x5c kill_f2fs_super+0x44/0x50 deactivate_locked_super+0x60/0x8c deactivate_super+0x68/0x74 cleanup_mnt+0x40/0x78 __cleanup_mnt+0x1c/0x28 task_work_run+0x48/0xd0 do_notify_resume+0x678/0xe98 work_pending+0x8/0x14 Fix this by not creating extents for those recovered files if shrinker is not registered yet. Once mount is successful and shrinker is registered, those files can have extents again. Signed-off-by: NSahitya Tummala <stummala@codeaurora.org> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Chao Yu 提交于
This patch cleans up checkpoint flow a bit: - remove unneeded circulation of flushing meta pages. - don't flush nat_bits pages in prior to other checkpoint pages. - add bug_on to check remained meta pages after flushing. Signed-off-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
Sometimes, I could observe # of issuing_discard to be 1 which blocks background jobs due to is_idle()=false. The only way to get out of it was to trigger gc_urgent. This patch avoids that by checking any candidates as done in the list. Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
Let's use "queued" instead of "issuing". Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Jaegeuk Kim 提交于
One report says memalloc failure during mount. (unwind_backtrace) from [<c010cd4c>] (show_stack+0x10/0x14) (show_stack) from [<c049c6b8>] (dump_stack+0x8c/0xa0) (dump_stack) from [<c024fcf0>] (warn_alloc+0xc4/0x160) (warn_alloc) from [<c0250218>] (__alloc_pages_nodemask+0x3f4/0x10d0) (__alloc_pages_nodemask) from [<c0270450>] (kmalloc_order_trace+0x2c/0x120) (kmalloc_order_trace) from [<c03fa748>] (build_node_manager+0x35c/0x688) (build_node_manager) from [<c03de494>] (f2fs_fill_super+0xf0c/0x16cc) (f2fs_fill_super) from [<c02a5864>] (mount_bdev+0x15c/0x188) (mount_bdev) from [<c03da624>] (f2fs_mount+0x18/0x20) (f2fs_mount) from [<c02a68b8>] (mount_fs+0x158/0x19c) (mount_fs) from [<c02c3c9c>] (vfs_kern_mount+0x78/0x134) (vfs_kern_mount) from [<c02c76ac>] (do_mount+0x474/0xca4) (do_mount) from [<c02c8264>] (SyS_mount+0x94/0xbc) (SyS_mount) from [<c0108180>] (ret_fast_syscall+0x0/0x48) Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
由 Yunlong Song 提交于
Commit 089842de ("f2fs: remove codes of unused wio_mutex") removes codes of unused wio_mutex, but missing the comment, so delete it. Signed-off-by: NYunlong Song <yunlong.song@huawei.com> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 23 12月, 2018 1 次提交
-
-
由 Christian Brauner 提交于
This reverts commit 55956b59. commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 12月, 2018 3 次提交
-
-
由 Omar Sandoval 提交于
At mount time, we allocate m_rsum_cache with the number of realtime bitmap blocks. However, xfs_growfs_rt() can increase the number of realtime bitmap blocks. Using the cache after this happens may access out of the bounds of the cache. Fix it by reallocating the cache in this case. Fixes: 355e3532 ("xfs: cache minimum realtime summary level") Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Dan Williams 提交于
get_unlocked_entry() uses an exclusive wait because it is guaranteed to eventually obtain the lock and follow on with an unlock+wakeup cycle. The wait_entry_unlocked() path does not have the same guarantee. Rather than open-code an extra wakeup, just switch to a non-exclusive wait. Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Eric Sandeen 提交于
iomap_is_partially_uptodate() is intended to check wither blocks within the selected range of a not-uptodate page are uptodate; if the range we care about is up to date, it's an optimization. However, the iomap implementation continues to check all blocks up to from+count, which is beyond the page, and can even be well beyond the iop->uptodate bitmap. I think the worst that will happen is that we may eventually find a zero bit and return "not partially uptodate" when it would have otherwise returned true, and skip the optimization. Still, it's clearly an invalid memory access that must be fixed. So: fix this by limiting the search to within the page as is done in the non-iomap variant, block_is_partially_uptodate(). Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k page system: BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0 Read of size 8 at addr ffff800120c3a318 by task fsstress/22337 Reported-by: NZorro Lang <zlang@redhat.com> Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NEric Sandeen <sandeen@sandeen.net> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 20 12月, 2018 3 次提交
-
-
由 Dave Chinner 提交于
This reverts commit 61c6de66. The reverted commit added page reference counting to iomap page structures that are used to track block size < page size state. This was supposed to align the code with page migration page accounting assumptions, but what it has done instead is break XFS filesystems. Every fstests run I've done on sub-page block size XFS filesystems has since picking up this commit 2 days ago has failed with bad page state errors such as: # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038" .... SECTION -- xfs FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+ MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc MOUNT_OPTIONS -- /dev/sdc /mnt/scratch generic/038 454s ... run fstests generic/038 at 2018-12-20 18:43:05 XFS (sdc): Unmounting Filesystem XFS (sdc): Mounting V5 Filesystem XFS (sdc): Ending clean mount BUG: Bad page state in process kswapd0 pfn:3a7fa page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1 flags: 0xfffffc0000000() raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360 raw: 0000000000000001 0000000000000000 00000000ffffffff page dumped because: non-NULL mapping CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 Call Trace: dump_stack+0x67/0x90 bad_page.cold.116+0x8a/0xbd free_pcppages_bulk+0x4bf/0x6a0 free_unref_page_list+0x10f/0x1f0 shrink_page_list+0x49d/0xf50 shrink_inactive_list+0x19d/0x3b0 shrink_node_memcg.constprop.77+0x398/0x690 ? shrink_slab.constprop.81+0x278/0x3f0 shrink_node+0x7a/0x2f0 kswapd+0x34b/0x6d0 ? node_reclaim+0x240/0x240 kthread+0x11f/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x24/0x30 Disabling lock debugging due to kernel taint .... The failures are from anyway that frees pages and empties the per-cpu page magazines, so it's not a predictable failure or an easy to debug failure. generic/038 is a reliable reproducer of this problem - it has a 9 in 10 failure rate on one of my test machines. Failure on other machines have been at random points in fstests runs but every run has ended up tripping this problem. Hence generic/038 was used to bisect the failure because it was the most reliable failure. It is too close to the 4.20 release (not to mention holidays) to try to diagnose, fix and test the underlying cause of the problem, so reverting the commit is the only option we have right now. The revert has been tested against a current tot 4.20-rc7+ kernel across multiple machines running sub-page block size XFs filesystems and none of the bad page state failures have been seen. Signed-off-by: NDave Chinner <dchinner@redhat.com> Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Darrick J. Wong 提交于
Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com>
-
由 Darrick J. Wong 提交于
Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com>
-