- 10 4月, 2017 21 次提交
-
-
由 Jan Kara 提交于
When userspace task processing fanotify permission events screws up and does not respond, fsnotify_mark_srcu SRCU is held indefinitely which causes further hangs in the whole notification subsystem. Although we cannot easily solve the problem of operations blocked waiting for response from userspace, we can at least somewhat localize the damage by dropping SRCU lock before waiting for userspace response and reacquiring it when userspace responds. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Pass fsnotify_iter_info into ->handle_event() handler so that it can release and reacquire SRCU lock via fsnotify_prepare_user_wait() and fsnotify_finish_user_wait() functions. These functions also make sure current marks are appropriately pinned so that iteration protected by srcu in fsnotify() stays safe. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response from userspace so that the whole notification subsystem is not blocked during that time. This patch provides a framework for safely getting mark reference for a mark found in the object list which pins the mark in that list. We can then drop fsnotify_mark_srcu, wait for userspace response and then safely continue iteration of the object list once we reaquire fsnotify_mark_srcu. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently we queue all marks for destruction on group shutdown and then destroy them from fsnotify_destroy_group() instead from a worker thread which is the usual path. However worker can already be processing some list of marks to destroy so this does not make 100% all marks are really destroyed by the time group is shut down. This isn't a big problem as each mark holds group reference and thus group stays partially alive until all marks are really freed but there's no point in complicating our lives - just wait for the delayed work to be finished instead. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Instead of removing mark from object list from fsnotify_detach_mark(), remove the mark when last reference to the mark is dropped. This will allow fanotify to wait for userspace response to event without having to hold onto fsnotify_mark_srcu. To avoid pinning inodes by elevated refcount (and thus e.g. delaying file deletion) while someone holds mark reference, we detach connector from the object also from fsnotify_destroy_marks() and not only after removing last mark from the list as it was now. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently we queue mark into a list of marks for destruction in __fsnotify_free_mark() and keep the last mark reference dangling. After the worker waits for SRCU period, it drops the last reference to the mark which frees it. This scheme has the disadvantage that if we hold reference to a mark and drop and reacquire SRCU lock, the mark can get freed immediately which is slightly inconvenient and we will need to avoid this in the future. Move to a scheme where queueing of mark into a list of marks for destruction happens when the last reference to the mark is dropped. Also drop reference to the mark held by group list already when mark is removed from that list instead of dropping it only from the destruction worker. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Dropping mark reference can result in mark being freed. Although it should not happen in inotify_remove_from_idr() since caller should hold another reference, just don't risk lock up just after WARN_ON unnecessarily. Also fold do_inotify_remove_from_idr() into the single callsite as that function really is just two lines of real code. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently we free fsnotify_mark_connector structure only when inode / vfsmount is getting freed. This can however impose noticeable memory overhead when marks get attached to inodes only temporarily. So free the connector structure once the last mark is detached from the object. Since notification infrastructure can be working with the connector under the protection of fsnotify_mark_srcu, we have to be careful and free the fsnotify_mark_connector only after SRCU period passes. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
So far list of marks attached to an object (inode / vfsmount) was protected by i_lock or mnt_root->d_lock. This dictates that the list must be empty before the object can be destroyed although the list is now anchored in the fsnotify_mark_connector structure. Protect the list by a spinlock in the fsnotify_mark_connector structure to decouple lifetime of a list of marks from a lifetime of the object. This also simplifies the code quite a bit since we don't have to differentiate between inode and vfsmount lists in quite a few places anymore. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
After removing all the indirection it is clear that hlist_del_init_rcu(&mark->obj_list); in fsnotify_destroy_marks() is not needed as the mark gets removed from the list shortly afterwards in fsnotify_destroy_mark() -> fsnotify_detach_mark() -> fsnotify_detach_from_object(). Also there is no problem with mark being visible on object list while we call fsnotify_destroy_mark() as parallel destruction of marks from several places is properly handled (as mentioned in the comment in fsnotify_destroy_marks(). So just remove the list removal and also the stale comment. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
We lock object list lock in fsnotify_detach_from_object() twice - once to detach mark and second time to recalculate mask. That is unnecessary and later it will become problematic as we will free the connector as soon as there is no mark in it. So move recalculation of fsnotify mask into the same critical section that is detaching mark. This also removes recalculation of child dentry flags from fsnotify_detach_from_object(). That is however fine. Those marks will get recalculated once some event happens on a child. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
fsnotify_detach_mark() calls fsnotify_destroy_inode_mark() or fsnotify_destroy_vfsmount_mark() to remove mark from object list. These two functions are however very similar and differ only in the lock they use to protect the object list of marks. Simplify the code by removing the indirection and removing mark from the object list in a common function. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Instead of passing spinlock into fsnotify_destroy_marks() determine it directly in that function from the connector type. This will reduce code churn when changing lock protecting list of marks. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Move locking of a mark list into fsnotify_find_mark(). This reduces code churn in the following patch changing lock protecting the list. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Move locking of locks protecting a list of marks into fsnotify_recalc_mask(). This reduces code churn in the following patch which changes the lock protecting the list of marks. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Move fsnotify_destroy_marks() to be later in the fs/notify/mark.c. It will need some functions that are declared after its current declaration. No functional change. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Adding notification mark to object list has been currently done through fsnotify_add_{inode|vfsmount}_mark() helpers from fsnotify_add_mark_locked() which call fsnotify_add_mark_list(). Remove this unnecessary indirection to simplify the code. Pushing all the locking to fsnotify_add_mark_list() also allows us to allocate the connector structure with GFP_KERNEL mode. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently inode reference is held by fsnotify marks. Change the rules so that inode reference is held by fsnotify_mark_connector structure whenever the list is non-empty. This simplifies the code and is more logical. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Move pointer to inode / vfsmount from mark itself to the fsnotify_mark_connector structure. This is another step on the path towards decoupling inode / vfsmount lifetime from notification mark lifetime. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently notification marks are attached to object (inode or vfsmnt) by a hlist_head in the object. The list is also protected by a spinlock in the object. So while there is any mark attached to the list of marks, the object must be pinned in memory (and thus e.g. last iput() deleting inode cannot happen). Also for list iteration in fsnotify() to work, we must hold fsnotify_mark_srcu lock so that mark itself and mark->obj_list.next cannot get freed. Thus we are required to wait for response to fanotify events from userspace process with fsnotify_mark_srcu lock held. That causes issues when userspace process is buggy and does not reply to some event - basically the whole notification subsystem gets eventually stuck. So to be able to drop fsnotify_mark_srcu lock while waiting for response, we have to pin the mark in memory and make sure it stays in the object list (as removing the mark waiting for response could lead to lost notification events for groups later in the list). However we don't want inode reclaim to block on such mark as that would lead to system just locking up elsewhere. This commit is the first in the series that paves way towards solving these conflicting lifetime needs. Instead of anchoring the list of marks directly in the object, we anchor it in a dedicated structure (fsnotify_mark_connector) and just point to that structure from the object. The following commits will also add spinlock protecting the list and object pointer to the structure. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Add a comment that lifetime of a notification mark is protected by SRCU and remove a comment about clearing of marks attached to the inode. It is stale and more uptodate version is at fsnotify_destroy_marks() which is the function handling this case. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 03 4月, 2017 3 次提交
-
-
由 Jan Kara 提交于
Move recalculation of inode / vfsmount notification mask under group->mark_mutex of the mark which was modified. These are the only places where mask recalculation happens without mark being protected from detaching from inode / vfsmount which will cause issues with the following patches. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Printing inode pointers in warnings has dubious value and with future changes we won't be able to easily get them without either locking or chances we oops along the way. So just remove inode pointers from the warning messages. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
show_fdinfo() iterates group's list of marks. All marks found there are guaranteed to be alive and they stay so until we release group->mark_mutex. So remove uncecessary tests whether mark is alive. Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 10 3月, 2017 9 次提交
-
-
由 David Hildenbrand 提交于
It's a void function, so there is no return value; Link: http://lkml.kernel.org/r/20170309150817.7510-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 OGAWA Hirofumi 提交于
Recently fallocate patch was merged and it uses MSDOS_I(inode)->mmu_private at fat_evict_inode(). However, fat_inode/fsinfo_inode that was introduced in past didn't initialize MSDOS_I(inode) properly. With those combinations, it became the cause of accessing random entry in FAT area. Link: http://lkml.kernel.org/r/87pohrj4i8.fsf@mail.parknet.co.jpSigned-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: NMoreno Bartalucci <moreno.bartalucci@tecnorama.it> Tested-by: NMoreno Bartalucci <moreno.bartalucci@tecnorama.it> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
userfaultfd_remove() has to be execute before zapping the pagetables or UFFDIO_COPY could keep filling pages after zap_page_range returned, which would result in non zero data after a MADV_DONTNEED. However userfaultfd_remove() may have to release the mmap_sem. This was handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a potentially stale vma (the very vma passed to zap_page_range(vma, ...)). The fix consists in revalidating the vma in case userfaultfd_remove() had to release the mmap_sem. This also optimizes away an unnecessary down_read/up_read in the MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered. It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as userfaultfd_remove() will be defined as "true" at build time. Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
We have a memleak in the ->new ctx if the uffd of the parent is closed before the fork event is read, nothing frees the new context. Link: http://lkml.kernel.org/r/20170302173738.18994-2-aarcange@redhat.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Don't stop running dup_fctx() even if userfaultfd_event_wait_completion fails as it has to run userfaultfd_ctx_put on all ctx to pair against the userfaultfd_ctx_get that was run on all fctx->orig in dup_userfaultfd. Link: http://lkml.kernel.org/r/20170224181957.19736-4-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Similar to the handle_userfault() case, also make sure to never attempt to send any event past the PF_EXITING point of no return. This is purely a robustness check. Link: http://lkml.kernel.org/r/20170224181957.19736-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Patch series "userfaultfd non-cooperative further update for 4.11 merge window". Unfortunately I noticed one relevant bug in userfaultfd_exit while doing more testing. I've been doing testing before and this was also tested by kbuild bot and exercised by the selftest, but this bug never reproduced before. I dropped userfaultfd_exit as result. I dropped it because of implementation difficulty in receiving signals in __mmput and because I think -ENOSPC as result from the background UFFDIO_COPY should be enough already. Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit wasn't exercised by the selftest and when I tried to exercise it, after moving it to a more correct place in __mmput where it would make more sense and where the vma list is stable, it resulted in the event_wait_completion in D state. So then I added the second patch to be sure even if we call userfaultfd_event_wait_completion too late during task exit(), we won't risk to generate tasks in D state. The same check exists in handle_userfault() for the same reason, except it makes a difference there, while here is just a robustness check and it's run under WARN_ON_ONCE. While looking at the userfaultfd_event_wait_completion() function I looked back at its callers too while at it and I think it's not ok to stop executing dup_fctx on the fcs list because we relay on userfaultfd_event_wait_completion to execute userfaultfd_ctx_put(fctx->orig) which is paired against userfaultfd_ctx_get(fctx->orig) in dup_userfault just before list_add(fcs). This change only takes care of fctx->orig but this area also needs further review looking for similar problems in fctx->new. The only patch that is urgent is the first because it's an use after free during a SMP race condition that affects all processes if CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably impossible without SLUB poisoning enabled. This patch (of 3): I once reproduced this oops with the userfaultfd selftest, it's not easily reproducible and it requires SLUB poisoning to reproduce. general protection fault: 0000 [#1] SMP Modules linked in: CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000 RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202 RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600 RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000 R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700 FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: do_exit+0x297/0xd10 SyS_exit+0x17/0x20 tracesys+0xdd/0xe2 Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80 RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP <ffff8801f833fe80> ---[ end trace 9fecd6dcb442846a ]--- In the debugger I located the "mm" pointer in the stack and walking mm->mmap->vm_next through the end shows the vma->vm_next list is fully consistent and it is null terminated list as expected. So this has to be an SMP race condition where userfaultfd_exit was running while the vma list was being modified by another CPU. When userfaultfd_exit() run one of the ->vm_next pointers pointed to SLAB_POISON (RBX is the vma pointer and is 0x6b6b..). The reason is that it's not running in __mmput but while there are still other threads running and it's not holding the mmap_sem (it can't as it has to wait the even to be received by the manager). So this is an use after free that was happening for all processes. One more implementation problem aside from the race condition: userfaultfd_exit has really to check a flag in mm->flags before walking the vma or it's going to slowdown the exit() path for regular tasks. One more implementation problem: at that point signals can't be delivered so it would also create a task in D state if the manager doesn't read the event. The major design issue: it overall looks superfluous as the manager can check for -ENOSPC in the background transfer: if (mmget_not_zero(ctx->mm)) { [..] } else { return -ENOSPC; } It's safer to roll it back and re-introduce it later if at all. [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT] Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
__do_fault assumes vmf->page has been initialized and is valid if VM_FAULT_NOPAGE is not returned by vma->vm_ops->fault(vma, vmf). handle_userfault() in turn should return VM_FAULT_NOPAGE if it doesn't return VM_FAULT_SIGBUS or VM_FAULT_RETRY (the other two possibilities). This VM_FAULT_NOPAGE case is only invoked when signal are pending and it didn't matter for anonymous memory before. It only started to matter since shmem was introduced. hugetlbfs also takes a different path and doesn't exercise __do_fault. Link: http://lkml.kernel.org/r/20170228154201.GH5816@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2017 3 次提交
-
-
由 Linus Torvalds 提交于
This removes the extra include header file that was added in commit e58bc927 "Pull overlayfs updates from Miklos Szeredi" now that it is no longer needed. There are probably other such includes that got added during the scheduler header splitup series, but this is the one that annoyed me personally and I know about. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
When a reflink operation causes the bmap code to allocate a btree block we're currently doing single-AG allocations due to having ->firstblock set and then try any higher AG due a little reflink quirk we've put in when adding the reflink code. But given that we do not have a minleft reservation of any kind in this AG we can still not have any space in the same or higher AG even if the file system has enough free space. To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back path instead. [And yes, we need to redo this properly instead of piling hacks over hacks. I'm working on that, but it's not going to be a small series. In the meantime this fixes the customer reported issue] Also add a warning for failing allocations to make it easier to debug. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Brian Foster 提交于
Commit fa7f138a ("xfs: clear delalloc and cache on buffered write failure") fixed one regression in the iomap error handling code and exposed another. The fundamental problem is that if a buffered write is a rewrite of preexisting delalloc blocks and the write fails, the failure handling code can punch out preexisting blocks with valid file data. This was reproduced directly by sub-block writes in the LTP kernel/syscalls/write/write03 test. A first 100 byte write allocates a single block in a file. A subsequent 100 byte write fails and punches out the block, including the data successfully written by the previous write. To address this problem, update the ->iomap_begin() handler to distinguish newly allocated delalloc blocks from preexisting delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the ->iomap_end() handler to decide when a failed or short write should punch out delalloc blocks. This introduces the subtle requirement that ->iomap_begin() should never combine newly allocated delalloc blocks with existing blocks in the resulting iomap descriptor. This can occur when a new delalloc reservation merges with a neighboring extent that is part of the current write, for example. Therefore, drop the post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and just return the record inserted into the fork. This ensures only new blocks are returned and thus that preexisting delalloc blocks are always handled as "found" blocks and not punched out on a failed rewrite. Reported-by: NXiong Zhou <xzhou@redhat.com> Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 08 3月, 2017 4 次提交
-
-
由 Darrick J. Wong 提交于
The sole remaining caller of kmem_zalloc_greedy is bulkstat, which uses it to grab 1-4 pages for staging of inobt records. The infinite loop in the greedy allocation function is causing hangs[1] in generic/269, so just get rid of the greedy allocator in favor of kmem_zalloc_large. This makes bulkstat somewhat more likely to ENOMEM if there's really no pages to spare, but eliminates a source of hangs. [1] http://lkml.kernel.org/r/20170301044634.rgidgdqqiiwsmfpj%40XZHOUW.usersys.redhat.comSigned-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> --- v2: remove single-page fallback
-
由 Chandan Rajendra 提交于
When block size is larger than inode cluster size, the call to XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs would have set xfs_sb->sb_inoalignmt to 0. Hence in xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly intialized to 0 because for every positive value of xfs_mount->m_dalign, the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to false. Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having -1 as the value because blks_per_cluster variable would have the value 1 and hence we would never have a need to use xfs_mount->m_inoalign_mask to compute the inode chunk's agbno and offset within the chunk. Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
There are two different cases of buffered I/O errors: - first we can have an already shutdown fs. In that case we should skip any on-disk operations and just clean up the appen transaction if present and destroy the ioend - a real I/O error. In that case we should cleanup any lingering COW blocks. This gets skipped in the current code and is fixed by this patch. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
We only want to reclaim preallocations from our periodic work item. Currently this is archived by looking for a dirty inode, but that check is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so that the caller can ask for just cancelling unwritten extents in the COW fork. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> [darrick: fix typos in commit message] Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-