- 06 2月, 2015 1 次提交
-
-
由 Michal Hocko 提交于
It has been reported that 965GM might trigger VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage) in mem_cgroup_migrate when shmem wants to replace a swap cache page because of shmem_should_replace_page (the page is allocated from an inappropriate zone). shmem_replace_page expects that the oldpage is not on LRU list and calls mem_cgroup_migrate without lrucare. This is obviously incorrect because swapcache pages might be on the LRU list (e.g. swapin readahead page). Fix this by enabling lrucare for the migration in shmem_replace_page. Also clarify that lrucare should be used even if one of the pages might be on LRU list. The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even without that the migration code might leave the old page on an inappropriate memcg' LRU which is not that critical because the page would get removed with its last reference but it is still confusing. Fixes: 0a31bc97 ("mm: memcontrol: rewrite uncharge API") Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reported-by: NChris Wilson <chris@chris-wilson.co.uk> Reported-by: NDave Airlie <airlied@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.17+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 12月, 2014 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 24 10月, 2014 1 次提交
-
-
由 Miklos Szeredi 提交于
Allocate a dentry, initialize it with a whiteout and hash it in the place of the old dentry. Later the old dentry will be moved away and the whiteout will remain. i_mutex protects agains concurrent readdir. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Cc: Hugh Dickins <hughd@google.com>
-
- 10 10月, 2014 1 次提交
-
-
由 Andrew Morton 提交于
This is designed to avoid a few ifdefs in .c files but it's obnoxious because it can cause unsuspecting "migrate_page" symbols to get turned into "NULL". Just nuke it and use the ifdefs. Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 9月, 2014 1 次提交
-
-
由 Miklos Szeredi 提交于
If overwriting an empty directory with rename, then need to drop the extra nlink. Test prog: #include <stdio.h> #include <fcntl.h> #include <err.h> #include <sys/stat.h> int main(void) { const char *test_dir1 = "test-dir1"; const char *test_dir2 = "test-dir2"; int res; int fd; struct stat statbuf; res = mkdir(test_dir1, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir1); res = mkdir(test_dir2, 0777); if (res == -1) err(1, "mkdir(\"%s\")", test_dir2); fd = open(test_dir2, O_RDONLY); if (fd == -1) err(1, "open(\"%s\")", test_dir2); res = rename(test_dir1, test_dir2); if (res == -1) err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2); res = fstat(fd, &statbuf); if (res == -1) err(1, "fstat(%i)", fd); if (statbuf.st_nlink != 0) { fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink); return 1; } return 0; } Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 9月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJan Kara <jack@suse.cz> Acked-by: N"David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>
-
- 09 8月, 2014 5 次提交
-
-
由 David Herrmann 提交于
If we set SEAL_WRITE on a file, we must make sure there cannot be any ongoing write-operations on the file. For write() calls, we simply lock the inode mutex, for mmap() we simply verify there're no writable mappings. However, there might be pages pinned by AIO, Direct-IO and similar operations via GUP. We must make sure those do not write to the memfd file after we set SEAL_WRITE. As there is no way to notify GUP users to drop pages or to wait for them to be done, we implement the wait ourself: When setting SEAL_WRITE, we check all pages for their ref-count. If it's bigger than 1, we know there's some user of the page. We then mark the page and wait for up to 150ms for those ref-counts to be dropped. If the ref-counts are not dropped in time, we refuse the seal operation. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Herrmann 提交于
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Herrmann 提交于
If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: NJet Chen <jet.chen@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Tested-by: NFelipe Balbi <balbi@ti.com> Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 8月, 2014 2 次提交
-
-
由 Miklos Szeredi 提交于
This is really simple in tmpfs since the VFS already takes care of shuffling the dentries. Just adjust nlink on parent directories and touch c & mtimes. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Miklos Szeredi 提交于
Implement ->rename2 instead of ->rename. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 07 8月, 2014 4 次提交
-
-
由 Wang Sheng-Hui 提交于
The gfp arg is not used in shmem_add_to_page_cache. Remove this unused arg. Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Do we really need an exported alias for __SetPageReferenced()? Its callers better know what they're doing, in which case the page would not be already marked referenced. Kill init_page_accessed(), just __SetPageReferenced() inline. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
A shared anonymous mapping created without MAP_NORESERVE holds memory reservation for whole range of shmem segment. Usually there is no way to change its size, but /proc/<pid>/map_files/... (available if CONFIG_CHECKPOINT_RESTORE=y) allows that. This patch adjusts the memory reservation in shmem_setattr(). Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
If __shmem_file_setup() fails on struct file allocation it uncharges memory commitment twice: first by shmem_unacct_size() and second time implicitly in shmem_evict_inode() when it kills the newly created inode. This patch removes shmem_unacct_size() from error path if the inode was already there. Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 7月, 2014 2 次提交
-
-
由 Hugh Dickins 提交于
shmem_fault() is the actual culprit in trinity's hole-punch starvation, and the most significant cause of such problems: since a page faulted is one that then appears page_mapped(), needing unmap_mapping_range() and i_mmap_mutex to be unmapped again. But it is not the only way in which a page can be brought into a hole in the radix_tree while that hole is being punched; and Vlastimil's testing implies that if enough other processors are busy filling in the hole, then shmem_undo_range() can be kept from completing indefinitely. shmem_file_splice_read() is the main other user of SGP_CACHE, which can instantiate shmem pagecache pages in the read-only case (without holding i_mutex, so perhaps concurrently with a hole-punch). Probably it's silly not to use SGP_READ already (using the ZERO_PAGE for holes): which ought to be safe, but might bring surprises - not a change to be rushed. shmem_read_mapping_page_gfp() is an internal interface used by drivers/gpu/drm GEM (and next by uprobes): it should be okay. And shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when called internally by the kernel (perhaps for a stacking filesystem, which might rely on holes to be reserved): it's unclear whether it could be provoked to keep hole-punch busy or not. We could apply the same umbrella as now used in shmem_fault() to shmem_file_splice_read() and the others; but it looks ugly, and use over a range raises questions - should it actually be per page? can these get starved themselves? The origin of this part of the problem is my v3.1 commit d0823576 ("mm: pincer in truncate_inode_pages_range"), once it was duplicated into shmem.c. It seemed like a nice idea at the time, to ensure (barring RCU lookup fuzziness) that there's an instant when the entire hole is empty; but the indefinitely repeated scans to ensure that make it vulnerable. Revert that "enhancement" to hole-punch from shmem_undo_range(), but retain the unproblematic rescanning when it's truncating; add a couple of comments there. Remove the "indices[0] >= end" test: that is now handled satisfactorily by the inner loop, and mem_cgroup_uncharge_start()/end() are too light to be worth avoiding here. But if we do not always loop indefinitely, we do need to handle the case of swap swizzled back to page before shmem_free_swap() gets it: add a retry for that case, as suggested by Konstantin Khlebnikov; and for the case of page swizzled back to swap, as suggested by Johannes Weiner. Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Suggested-by: NVlastimil Babka <vbabka@suse.cz> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lukas Czerner <lczerner@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Commit f00cdc6d ("shmem: fix faulting into a hole while it's punched") was buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in the fault path is a no-no (write syscall may already hold i_mutex while faulting user buffer). We tried a completely different approach (see following patch) but that proved inadequate: good enough for a rational workload, but not good enough against trinity - which forks off so many mappings of the object that contention on i_mmap_mutex while hole-puncher holds i_mutex builds into serious starvation when concurrent faults force the puncher to fall back to single-page unmap_mapping_range() searches of the i_mmap tree. So return to the original umbrella approach, but keep away from i_mutex this time. We really don't want to bloat every shmem inode with a new mutex or completion, just to protect this unlikely case from trinity. So extend the original with wait_queue_head on stack at the hole-punch end, and wait_queue item on the stack at the fault end. This involves further use of i_lock to guard against the races: lockdep has been happy so far, and I see fs/inode.c:unlock_new_inode() holds i_lock around wake_up_bit(), which is comparable to what we do here. i_lock is more convenient, but we could switch to shmem's info->lock. This issue has been tagged with CVE-2014-4171, which will require commit f00cdc6d and this and the following patch to be backported: we suggest to 3.1+, though in fact the trinity forkbomb effect might go back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might not, since much has changed, with i_mmap_mutex a spinlock before 3.0. Anyone running trinity on 3.0 and earlier? I don't think we need care. Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lukas Czerner <lczerner@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 7月, 2014 1 次提交
-
-
由 Hugh Dickins 提交于
Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU) in isolate_lru_pages() at mm/vmscan.c:1281! Commit 2457aec6 ("mm: non-atomically mark page accessed during page cache allocation where possible") looks like interrupted work-in-progress. mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's - shmem_write_begin() is clearly wrong to use it after shmem_getpage(), when the page is always visible in radix_tree, and often already on LRU. Revert change to shmem_write_begin(), and use init_page_accessed() or mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp(). SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed() before; but since many other filesystems use [__]page_symlink(), which did and does mark the page accessed, consider this as rectifying an oversight. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 6月, 2014 2 次提交
-
-
由 Hugh Dickins 提交于
Trinity finds that mmap access to a hole while it's punched from shmem can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from completing, until the reader chooses to stop; with the puncher's hold on i_mutex locking out all other writers until it can complete. It appears that the tmpfs fault path is too light in comparison with its hole-punching path, lacking an i_data_sem to obstruct it; but we don't want to slow down the common case. Extend shmem_fallocate()'s existing range notification mechanism, so shmem_fault() can refrain from faulting pages into the hole while it's punched, waiting instead on i_mutex (when safe to sleep; or repeatedly faulting when not). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE support being added to fallocate(); but didn't realize until now that I had been too stupid to future-proof shmem_fallocate() against new additions. -EOPNOTSUPP instead of going on to ordinary fallocation. Signed-off-by: NHugh Dickins <hughd@google.com> Reviewed-by: NLukas Czerner <lczerner@redhat.com> Cc: <stable@vger.kernel.org> [3.15] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 6月, 2014 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 05 6月, 2014 2 次提交
-
-
由 Mel Gorman 提交于
aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NRik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 5月, 2014 4 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 4月, 2014 1 次提交
-
-
由 Geert Uytterhoeven 提交于
Some versions of gcc even warn about it: mm/shmem.c: In function ‘shmem_file_aio_read’: mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function If the loop is aborted during the first iteration by one of the two first break statements, error will be uninitialized. Introduced by commit 6e58e79d ("introduce copy_page_to_iter, kill loop over iovec in generic_file_aio_read()"). Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Acked-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 4月, 2014 1 次提交
-
-
由 Al Viro 提交于
that commit has fixed only the parts of that mess in fs/splice.c itself; there had been more in several other ->splice_read() instances... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 4月, 2014 2 次提交
-
-
由 Michal Hocko 提交于
mem_cgroup_newpage_charge is used only for charging anonymous memory so it is better to rename it to mem_cgroup_charge_anon. mem_cgroup_cache_charge is used for file backed memory so rename it to mem_cgroup_charge_file. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ning Qu 提交于
In shmem/tmpfs, we also use the generic filemap_map_pages, seems the additional checking is not worth a separate version of map_pages for it. Signed-off-by: NNing Qu <quning@google.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 4月, 2014 2 次提交
-
-
由 Johannes Weiner 提交于
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Page cache radix tree slots are usually stabilized by the page lock, but shmem's swap cookies have no such thing. Because the overall truncation loop is lockless, the swap entry is currently confirmed by a tree lookup and then deleted by another tree lookup under the same tree lock region. Use radix_tree_delete_item() instead, which does the verification and deletion with only one lookup. This also allows removing the delete-only special case from shmem_radix_tree_replace(). Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 4月, 2014 2 次提交
-
-
由 Al Viro 提交于
generic_file_aio_read() was looping over the target iovec, with loop over (source) pages nested inside that. Just set an iov_iter up and pass *that* to do_generic_file_aio_read(). With copy_page_to_iter() doing all work of mapping and copying a page to iovec and advancing iov_iter. Switch shmem_file_aio_read() to the same and kill file_read_actor(), while we are at it. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 1月, 2014 1 次提交
-
-
由 Christoph Hellwig 提交于
And instead convert tmpfs to use the new generic ACL code, with two stub methods provided for in-memory filesystems. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 24 1月, 2014 1 次提交
-
-
由 Sasha Levin 提交于
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 12月, 2013 1 次提交
-
-
由 Eric Paris 提交于
We have a problem where the big_key key storage implementation uses a shmem backed inode to hold the key contents. Because of this detail of implementation LSM checks are being done between processes trying to read the keys and the tmpfs backed inode. The LSM checks are already being handled on the key interface level and should not be enforced at the inode level (since the inode is an implementation detail, not a part of the security model) This patch implements a new function shmem_kernel_file_setup() which returns the equivalent to shmem_file_setup() only the underlying inode has S_PRIVATE set. This means that all LSM checks for the inode in question are skipped. It should only be used for kernel internal operations where the inode is not exposed to userspace without proper LSM checking. It is possible that some other users of shmem_file_setup() should use the new interface, but this has not been explored. Reproducing this bug is a little bit difficult. The steps I used on Fedora are: (1) Turn off selinux enforcing: setenforce 0 (2) Create a huge key k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s` (3) Access the key in another context: runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null (4) Examine the audit logs: ausearch -m AVC -i --subject httpd_t | audit2allow If the last command's output includes a line that looks like: allow httpd_t user_tmpfs_t:file { open read }; There was an inode check between httpd and the tmpfs filesystem. With this patch no such denial will be seen. (NOTE! you should clear your audit log if you have tested for this previously) (Please return you box to enforcing) Signed-off-by: NEric Paris <eparis@redhat.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: Hugh Dickins <hughd@google.com> cc: linux-mm@kvack.org
-