- 08 6月, 2018 18 次提交
-
-
由 Alexey Dobriyan 提交于
struct kstat is thread local. Link: http://lkml.kernel.org/r/20180423213626.GB9043@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Code can be sonsolidated if a dummy region of 0 length is used in normal case of \0-separated command line: 1) [arg_start, arg_end) + [dummy len=0] 2) [arg_start, arg_end) + [env_start, env_end) Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
"rv" variable is used both as a counter of bytes transferred and an error value holder but it can be reduced solely to error values if original start of userspace buffer is stashed and used at the very end. [akpm@linux-foundation.org: simplify cleanup code] Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
"final" variable is OK but we can get away with less lines. Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
access_remote_vm() doesn't return negative errors, it returns number of bytes read/written (0 if error occurs). This allows to delete some comparisons which never trigger. Reuse "nr_read" variable while I'm at it. Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
If a process monitored with userfaultfd changes it's memory mappings or forks() at the same time as uffd monitor fills the process memory with UFFDIO_COPY, the actual creation of page table entries and copying of the data in mcopy_atomic may happen either before of after the memory mapping modifications and there is no way for the uffd monitor to maintain consistent view of the process memory layout. For instance, let's consider fork() running in parallel with userfaultfd_copy(): process | uffd monitor ---------------------------------+------------------------------ fork() | userfaultfd_copy() ... | ... dup_mmap() | down_read(mmap_sem) down_write(mmap_sem) | /* create PTEs, copy data */ dup_uffd() | up_read(mmap_sem) copy_page_range() | up_write(mmap_sem) | dup_uffd_complete() | /* notify monitor */ | If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be present by the time copy_page_range() is called and they will appear in the child's memory mappings. However, if the fork() is the first to take the mmap_sem, the new pages won't be mapped in the child's address space. If the pages are not present and child tries to access them, the monitor will get page fault notification and everything is fine. However, if the pages *are present*, the child can access them without uffd noticing. And if we copy them into child it'll see the wrong data. Since we are talking about background copy, we'd need to decide whether the pages should be copied or not regardless #PF notifications. Since userfaultfd monitor has no way to determine what was the order, let's disallow userfaultfd_copy in parallel with the non-cooperative events. In such case we return -EAGAIN and the uffd monitor can understand that userfaultfd_copy() clashed with a non-cooperative event and take an appropriate action. Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: NPavel Emelyanov <xemul@virtuozzo.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
Define a new PageTable bit in the page_type and use it to mark pages in use as page tables. This can be helpful when debugging crashdumps or analysing memory fragmentation. Add a KPF flag to report these pages to userspace and update page-types.c to interpret that flag. Note that only pages currently accounted as NR_PAGETABLES are tracked as PageTable; this does not include pgd/p4d/pud/pmd pages. Those will be the subject of a later patch. Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
In commit ab676b7d ("pagemap: do not leak physical addresses to non-privileged userspace"), the /proc/PID/pagemap is restricted to be readable only by CAP_SYS_ADMIN to address some security issue. In commit 1c90308e ("pagemap: hide physical addresses from non-privileged users"), the restriction is relieved to make /proc/PID/pagemap readable, but hide the physical addresses for non-privileged users. But the swap entries are readable for non-privileged users too. This has some security issues. For example, for page under migrating, the swap entry has physical address information. So, in this patch, the swap entries are hided for non-privileged users too. Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com Fixes: 1c90308e ("pagemap: hide physical addresses from non-privileged users") Signed-off-by: N"Huang, Ying" <ying.huang@intel.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrei Vagin <avagin@openvz.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Daniel Colascione <dancol@google.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Kravetz 提交于
With the addition of memfd hugetlbfs support, we now have the situation where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only supported on tmpfs, so it made sense that the code resided in shmem.c. In the current code, memfd is only functional if TMPFS is defined. If HUGETLFS is defined and TMPFS is not defined, then memfd functionality will not be available for hugetlbfs. This does not cause BUGs, just a lack of potentially desired functionality. Code is restructured in the following way: - include/linux/memfd.h is a new file containing memfd specific definitions previously contained in shmem_fs.h. - mm/memfd.c is a new file containing memfd specific code previously contained in shmem.c. - memfd specific code is removed from shmem_fs.h and shmem.c. - A new config option MEMFD_CREATE is added that is defined if TMPFS or HUGETLBFS is defined. No functional changes are made to the code: restructuring only. Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chengguang Xu 提交于
Currently when detecting invalid options in option parsing, some options(e.g. msize) just set errno and allow to continuously validate other options so that it can detect invalid options as much as possible and give proper error messages together. This patch applies same rule to option 'cache' and 'access' when detecting -EINVAL. Link: http://lkml.kernel.org/r/1525340676-34072-2-git-send-email-cgxu519@gmx.comSigned-off-by: NChengguang Xu <cgxu519@gmx.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t") vmf_error() is the newly introduce inline function in 4.18. Fix one checkpatch.pl warning by replacing BUG_ON() with WARN_ON() [akpm@linux-foundation.org: undo BUG_ON->WARN_ON change] Link: http://lkml.kernel.org/r/20180523153258.GA28451@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Salvatore Mesoraca 提交于
Avoid a VLA by using a real constant expression instead of a variable. The compiler should be able to optimize the original code and avoid using an actual VLA. Anyway this change is useful because it will avoid a false positive with -Wvla, it might also help the compiler generating better code. Link: http://lkml.kernel.org/r/1520970710-19732-1-git-send-email-s.mesoraca16@gmail.comSigned-off-by: NSalvatore Mesoraca <s.mesoraca16@gmail.com> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Guozhonghua 提交于
Correct the comments position of the structure ocfs2_dir_block_trailer. Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071C5FDE@H3CMLB12-EX.srv.huawei-3com.comSigned-off-by: Nguozhonghua <guozhonghua@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhen Lei 提交于
The warning is invalid because the parameter chunksize passed from ocfs2_info_freefrag_scan_chain-->ocfs2_info_update_ffg is guaranteed to be positive. So __ilog2_u32 cannot return -1. fs/ocfs2/ioctl.c: In function 'ocfs2_info_update_ffg': fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds] hist->fc_chunks[index]++; ^ fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds] Link: http://lkml.kernel.org/r/1524655799-12112-1-git-send-email-thunder.leizhen@huawei.comSigned-off-by: NZhen Lei <thunder.leizhen@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Larry Chen 提交于
ocfs2_inode_lock_tracker as a variant of ocfs2_inode_lock, is used to prevent deadlock due to recursive lock acquisition. But this function does not distinguish whether the requested level is EX or PR. If a RP lock has been attained, this function will immediately return success afterwards even an EX lock is requested. But actually the return value does not mean that the process got a EX lock, because ocfs2_inode_lock has not been called. When taking lock levels into account, we face some different situations: 1. no lock is held In this case, just lock the inode and return 0 2. We are holding a lock For this situation, things diverges into several cases wanted holding what to do ex ex see 2.1 below ex pr see 2.2 below pr ex see 2.1 below pr pr see 2.1 below 2.1 lock level that is been held is compatible with the wanted level, so no lock action will be tacken. 2.2 Otherwise, an upgrade is needed, but it is forbidden. Reason why upgrade within a process is forbidden is that lock upgrade may cause dead lock. The following illustrate how it happens. process 1 process 2 ocfs2_inode_lock_tracker(ex=0) <====== ocfs2_inode_lock_tracker(ex=1) ocfs2_inode_lock_tracker(ex=1) For the status quo of ocfs2, without this patch, neither a bug nor end-user impact will be caused because the wrong logic is avoided. But I'm afraid this generic interface, may be called by other developers in future and used in this situation. a process ocfs2_inode_lock_tracker(ex=0) ocfs2_inode_lock_tracker(ex=1) Link: http://lkml.kernel.org/r/20180510053230.17217-1-lchen@suse.comSigned-off-by: NLarry Chen <lchen@suse.com> Reviewed-by: NGang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jia Guo 提交于
ocfs2_extend_allocation() has been deleted, clean up its declaration. Also change the static function name from __ocfs2_extend_allocation() to ocfs2_extend_allocation() to be consistent with the corresponding trace events as well as comments for ocfs2_lock_allocators(). Link: http://lkml.kernel.org/r/09cf7125-6f12-e53e-20f5-e606b2c16b48@huawei.com Fixes: 964f14a0 ("ocfs2: clean up some dead code") Signed-off-by: NJia Guo <guojia12@huawei.com> Acked-by: NJoseph Qi <jiangqi903@gmail.com> Reviewed-by: NJun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f4220 ("mm: change return type to vm_fault_t") There was an existing bug inside dax_load_hole() if vm_insert_mixed had failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed. vm_insert_mixed_mkwrite has inefficiency when it returns an error value, driver has to convert it to vm_fault_t type. With new vmf_insert_mixed_mkwrite() this limitation will be addressed. Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 6月, 2018 1 次提交
-
-
由 Kees Cook 提交于
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This patch makes the changes for kmalloc()-family (and kvmalloc()-family) uses. It was done via automatic conversion with manual review for the "CHECKME" non-standard cases noted below, using the following Coccinelle script: // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * // sizeof *pkey_cache->table, GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // Same pattern, but can't trivially locate the trailing element name, // or variable name. @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; expression SOMETHING, COUNT, ELEMENT; @@ - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP) + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP) Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 05 6月, 2018 1 次提交
-
-
由 David Howells 提交于
Sometimes an in-progress call will stop responding on the fileserver when the fileserver quietly cancels the call with an internally marked abort (RX_CALL_DEAD), without sending an ABORT to the client. This causes the client's call to eventually expire from lack of incoming packets directed its way, which currently leads to it being cancelled locally with ETIME. Note that it's not currently clear as to why this happens as it's really hard to reproduce. The rotation policy implement by kAFS, however, doesn't differentiate between ETIME meaning we didn't get any response from the server and ETIME meaning the call got cancelled mid-flow. The latter leads to an oops when fetching data as the rotation partially resets the afs_read descriptor, which can result in a cleared page pointer being dereferenced because that page has already been filled. Handle this by the following means: (1) Set a flag on a call when we receive a packet for it. (2) Store the highest packet serial number so far received for a call (bearing in mind this may wrap). (3) If, when the "not received anything recently" timeout expires on a call, we've received at least one packet for a call and the connection as a whole has received packets more recently than that call, then cancel the call locally with ECONNRESET rather than ETIME. This indicates that the call was definitely in progress on the server. (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME, don't try the next server, but rather abort the call. This avoids the oops as we don't try to reuse the afs_read struct. Rather, as-yet ungotten pages will be reread at a later data. Also: (5) Add an rxrpc tracepoint to log detection of the call being reset. Without this, I occasionally see an oops like the following: general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:_copy_to_iter+0x204/0x310 RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206 RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560 RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000 RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400 R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958 R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560 FS: 00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0 Call Trace: skb_copy_datagram_iter+0x14e/0x289 rxrpc_recvmsg_data.isra.0+0x6f3/0xf68 ? trace_buffer_unlock_commit_regs+0x4f/0x89 rxrpc_kernel_recv_data+0x149/0x421 afs_extract_data+0x1e0/0x798 ? afs_wait_for_call_to_complete+0xc9/0x52e afs_deliver_fs_fetch_data+0x33a/0x5ab afs_deliver_to_call+0x1ee/0x5e0 ? afs_wait_for_call_to_complete+0xc9/0x52e afs_wait_for_call_to_complete+0x12b/0x52e ? wake_up_q+0x54/0x54 afs_make_call+0x287/0x462 ? afs_fs_fetch_data+0x3e6/0x3ed ? rcu_read_lock_sched_held+0x5d/0x63 afs_fs_fetch_data+0x3e6/0x3ed afs_fetch_data+0xbb/0x14a afs_readpages+0x317/0x40d __do_page_cache_readahead+0x203/0x2ba ? ondemand_readahead+0x3a7/0x3c1 ondemand_readahead+0x3a7/0x3c1 generic_file_buffered_read+0x18b/0x62f __vfs_read+0xdb/0xfe vfs_read+0xb2/0x137 ksys_read+0x50/0x8c do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Note the weird value in RDI which is a result of trying to kmap() a NULL page pointer. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 6月, 2018 8 次提交
-
-
由 Andreas Gruenbacher 提交于
Clean up gfs2_iomap_alloc and gfs2_iomap_get. Document how gfs2_iomap_alloc works: it now needs to be called separately after gfs2_iomap_get where necessary; this will be used later by iomap write. Move gfs2_iomap_ops into bmap.c. Introduce a new gfs2_iomap_get_alloc helper and use it in fallocate_chunk: gfs2_iomap_begin will become unsuitable for fallocate with proper iomap write support. In gfs2_block_map and fallocate_chunk, zero-initialize struct iomap. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
In journaled data mode, we need to add each buffer head to the current transaction. In ordered write mode, we only need to add the inode to the ordered inode list. So far, both cases are handled in gfs2_trans_add_data. This makes the code look misleading and is inefficient for small block sizes as well. Handle both cases separately instead. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
First, change the sanity check in gfs2_stuffed_write_end to check for the actual write size instead of the requested write size. Second, use the existing teardown code in gfs2_write_end instead of duplicating it in gfs2_stuffed_write_end. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
Reimplement function hole_size based on a generic function for walking the metadata tree and rename hole_size to gfs2_hole_size. While previously, multiple invocations of hole_size were sometimes needed to walk across the entire hole, the new implementation always returns the entire hole at once (provided that the caller is interested in the total size). Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Bob Peterson 提交于
Function gfs2_free_extlen calculates the length of an extent of free blocks that may be reserved. The end pointer was calculated as end = start + bh->b_size but b_size is incorrect because the bitmap usually stops prior to the end of the buffer data on the last bitmap. What this means is that when you do a write, you can reserve a chunk of blocks that runs off the end of the last bitmap. For example, I've got a file system where there is only one bitmap for each rgrp, so ri_length==1. I saw cases in which iozone tried to do a big write, grabbed a large block reservation, chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188. So 5464153 + 8188 = 5472341 which is the end of the rgrp. When it grabbed a reservation it got back: 5470936, length 7229. But 5470936 + 7229 = 5478165. So the reservation starts inside the rgrp but runs 5824 blocks past the end of the bitmap. This patch fixes the calculation so it won't exceed the last bitmap. It also adds a BUG_ON to guard against overflows in the future. Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
Before this patch function gfs2_write_begin, upon discovering an error, called gfs2_trim_blocks while the rgrp glock was still held. That's because gfs2_inplace_release is not called until later. This patch reorganizes the logic a bit so gfs2_inplace_release is called to release the lock prior to the call to gfs2_trim_blocks, thus preventing the glock recursion. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Al Viro 提交于
This reverts commit cab64df1. Having vfs_open() in some cases drop the reference to struct file combined with error = vfs_open(path, f, cred); if (error) { put_filp(f); return ERR_PTR(error); } return f; is flat-out wrong. It used to be error = vfs_open(path, f, cred); if (!error) { /* from now on we need fput() to dispose of f */ error = open_check_o_direct(f); if (error) { fput(f); f = ERR_PTR(error); } } else { put_filp(f); f = ERR_PTR(error); } and sure, having that open_check_o_direct() boilerplate gotten rid of is nice, but not that way... Worse, another call chain (via finish_open()) is FUBAR now wrt FILE_OPENED handling - in that case we get error returned, with file already hit by fput() *AND* FILE_OPENED not set. Guess what happens in path_openat(), when it hits if (!(opened & FILE_OPENED)) { BUG_ON(!error); put_filp(file); } The root cause of all that crap is that the callers of do_dentry_open() have no way to tell which way did it fail; while that could be fixed up (by passing something like int *opened to do_dentry_open() and have it marked if we'd called ->open()), it's probably much too late in the cycle to do so right now. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 6月, 2018 4 次提交
-
-
由 Long Li 提交于
Add a function to allocate wdata without allocating pages for data transfer. This gives the caller an option to pass a number of pages that point to the data buffer to write to. wdata is reponsible for free those pages after it's done. Signed-off-by: NLong Li <longli@microsoft.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Long Li 提交于
With offset defined in rdata, transport functions need to look at this offset when reading data into the correct places in pages. Signed-off-by: NLong Li <longli@microsoft.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Long Li 提交于
Add a function to allocate rdata without allocating pages for data transfer. This gives the caller an option to pass a number of pages that point to the data buffer. rdata is still reponsible for free those pages after it's done. Signed-off-by: NLong Li <longli@microsoft.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
由 Ronnie Sahlberg 提交于
Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSteve French <smfrench@gmail.com>
-
- 02 6月, 2018 8 次提交
-
-
由 Christoph Hellwig 提交于
This way the implementation doesn't depend on buffer_head internals. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
We only call into this function through the iomap iterators, so we already know the buffer is unwritten. In addition to that we always require the uptodate flag that is ORed with the result anyway. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
This function is only used by the iomap code, depends on being called from it, and will soon stop poking into buffer head internals. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Switch to the iomap based bmap implementation to get rid of one of the last users of xfs_get_blocks. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
This adds a simple iomap-based implementation of the legacy ->bmap interface. Note that we can't easily add checks for rt or reflink files, so these will have to remain in the callers. This interface just needs to die.. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Factor the repeated calculation of the on-disk sector for a given logical block into a littler helper. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
We don't need any merging logic, and this also replaces a BUG_ON with a WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Just define a range of fs specific flags and use that in gfs2 instead of exposing this internal flag globally. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-