- 07 7月, 2017 9 次提交
-
-
由 Roman Gushchin 提交于
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Punit Agrawal 提交于
A poisoned or migrated hugepage is stored as a swap entry in the page tables. On architectures that support hugepages consisting of contiguous page table entries (such as on arm64) this leads to ambiguity in determining the page table entry to return in huge_pte_offset() when a poisoned entry is encountered. Let's remove the ambiguity by adding a size parameter to convey additional information about the requested address. Also fixup the definition/usage of huge_pte_offset() throughout the tree. Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.comSigned-off-by: NPunit Agrawal <punit.agrawal@arm.com> Acked-by: NSteve Capper <steve.capper@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE) Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS) Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Tatashin 提交于
Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: NBabu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
Calculation of start end end in __wake_userfault function are not used and can be removed. Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
There is no real reason to duplicate kvmalloc* helpers so drop alloc_fdmem and replace it with the appropriate library function. Link: http://lkml.kernel.org/r/20170531155145.17111-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arvind Yadav 提交于
attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 4402 1088 38 5528 1598 fs/ocfs2/stackglue.o File size After adding 'const': text data bss dec hex filename 4442 1024 38 5504 1580 fs/ocfs2/stackglue.o Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.comSigned-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 piaojun 提交于
'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end of sc_fop_release(). Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com> Reviewed-by: NJoseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fabian Frederick 提交于
Filesystems generally use SUPER_MAGIC values from magic.h instead of a local definition. Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.beSigned-off-by: NFabian Frederick <fabf@skynet.be> Reviewed-by: NMark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gang He 提交于
Fix a static code checker warning: fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR' Fixes: d56a8f32 ("ocfs2: check/fix inode block for online file check") Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com> Reviewed-by: NJoseph Qi <jiangqi903@gmail.com> Reviewed-by: NEric Ren <zren@suse.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 7月, 2017 5 次提交
-
-
由 Arvind Yadav 提交于
attribute_groups are not supposed to change at runtime. All functions working with attribute_groups provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. File size before: text data bss dec hex filename 5259 1344 8 6611 19d3 fs/gfs2/sys.o File size After adding 'const': text data bss dec hex filename 5371 1216 8 6595 19c3 fs/gfs2/sys.o Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
On failure, keep the inode glock across the final iput of the new inode so that gfs2_evict_inode doesn't have to re-acquire the glock. That way, gfs2_evict_inode won't need to revalidate the block type. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
This patch adds a standardized queueing mechanism for glock work with spin_lock protection to prevent races. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
Put all remaining accesses to gl->gl_object under the gl->gl_lockref.lock spinlock to prevent races. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
由 Andreas Gruenbacher 提交于
So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock work queue to make sure that inode glops which dereference gl->gl_object have finished running before the inode is destroyed. However, flushing the work queue may do more work than needed, and in particular, it may call into DLM, which we want to avoid here. Use a bit lock (GIF_GLOP_PENDING) to synchronize between the inode glops and gfs2_evict_inode instead to get rid of the flushing. In addition, flush the work queues of existing glocks before reusing them for new inodes to get those glocks into a known state: the glock state engine currently doesn't handle glock re-appropriation correctly. (We may be able to fix the glock state engine instead later.) Based on a patch by Steven Whitehouse <swhiteho@redhat.com>. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 30 6月, 2017 26 次提交
-
-
由 Deepa Dinamani 提交于
Usage of these apis and their compat versions makes the syscalls: timerfd_settime and timerfd_gettime and their compat implementations simpler. This patch also serves as a preparatory patch for changing syscalls to use new time_t data types to support the y2038 effort by isolating the processing of user pointers through these apis. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Simpler done in the only caller. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Instead of messing with the address limit to use vfs_read/vfs_writev. Note that this requires that exported file implement ->read_iter and ->write_iter. All currently exportable file systems do this. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
De-dupliate some code and allow for passing the flags argument to vfs_iter_write. Additionally it now properly updates timestamps. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
De-dupliate some code and allow for passing the flags argument to vfs_iter_read. Additional it properly updates atime now. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
The checks for the permissions and can read / write flags are common for the callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Split it into one helper each for reads vs writes. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
opencode it in both callers to simplify the call stack a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
opencode it in both callers to simplify the call stack a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Qu Wenruo 提交于
Commit 4751832d ("btrfs: fiemap: Cache and merge fiemap extent before submit it to user") introduced a warning to catch unemitted cached fiemap extent. However such warning doesn't take the following case into consideration: 0 4K 8K |<---- fiemap range --->| |<----------- On-disk extent ------------------>| In this case, the whole 0~8K is cached, and since it's larger than fiemap range, it break the fiemap extent emit loop. This leaves the fiemap extent cached but not emitted, and caught by the final fiemap extent sanity check, causing kernel warning. This patch removes the kernel warning and renames the sanity check to emit_last_fiemap_cache() since it's possible and valid to have cached fiemap extent. Reported-by: NDavid Sterba <dsterba@suse.cz> Reported-by: NAdam Borowski <kilobyte@angband.pl> Fixes: 4751832d ("btrfs: fiemap: Cache and merge fiemap extent ...") Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jan Kara 提交于
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 07393101 CC: stable@vger.kernel.org CC: linux-btrfs@vger.kernel.org CC: David Sterba <dsterba@suse.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Chris Mason 提交于
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af24 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: NDave Jones <davej@codemonkey.org.uk> Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a helper but wrongly sets up the target device. Incidentally there's a local variable with the same name as a parameter in the previous function, so this got caught during runtime as crash in test btrfs/027. Reported-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled [BUG] Under the following case, we can underflow qgroup reserved space. Task A | Task B --------------------------------------------------------------- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io | |- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes size of qgroup reserved_space in all cases. And btrfs_qgroup_release_data() will check if quotas are enabled. However in the above case, the buffered write happens before quota is enabled, so we don't have the reserved space for that range. [FIX] In insert_reserved_file_extent(), we tell qgroup to release the acctual byte number it released. In the above case, since we don't have the reserved space, we tell qgroups to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by the qgroup rework, and previous patch to return released bytes, the fix can be as small as 10 lines. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ changelog updates ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
btrfs_qgroup_release/free_data() only returns 0 or a negative error number (ENOMEM is the only possible error). This is normally good enough, but sometimes we need the exact byte count it freed/released. Change it to return actually released/freed bytenr number instead of 0 for success. And slightly modify related extent_changeset structure, since in btrfs one no-hole data extent won't be larger than 128M, so "unsigned int" is large enough for the use case. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Quite a lot of qgroup corruption happens due to wrong time of calling btrfs_qgroup_prepare_account_extents(). Since the safest time is to call it just before btrfs_qgroup_account_extents(), there is no need to separate these 2 functions. Merging them will make code cleaner and less bug prone. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ changelog and comment adjustments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents. The quick exit condition is: 1) The extent belongs to a non-fs tree Only fs-tree extents can affect qgroup numbers and is the only case where extent can be shared between different trees. Although strictly speaking extent in data-reloc or tree-reloc tree can be shared, data/tree-reloc root won't appear in the result of btrfs_find_all_roots(), so we can ignore such case. So we can check the first root in old_roots/new_roots ulist. - if we find the 1st root is a not a fs/subvol root, then we can skip the extent - if we find the 1st root is a fs/subvol root, then we must continue calculation OR 2) both 'nr_old_roots' and 'nr_new_roots' are 0 This means either such extent got allocated then freed in current transaction or it's a new reloc tree extent, whose nr_new_roots is 0. Either way it won't affect qgroup accounting and can be skipped safely. Such quick exit can make trace output more quite and less confusing: (example with fs uuid and time stamp removed) Before: ------ add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0 btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1 ------ Extent tree block will trigger btrfs_qgroup_account_extent() trace point while no qgroup number is changed, as extent tree won't affect qgroup accounting. After: ------ add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0 ------ Now such unrelated extent won't trigger btrfs_qgroup_account_extent() trace point, making the trace less noisy. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ changelog and comment adjustments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
The total_bytes_pinned counter is completely broken when accounting delayed refs: - If two drops for the same extent are merged, we will decrement total_bytes_pinned twice but only increment it once. - If an add is merged into a drop or vice versa, we will decrement the total_bytes_pinned counter but never increment it. - If multiple references to an extent are dropped, we will account it multiple times, potentially vastly over-estimating the number of bytes that will be freed by a commit and doing unnecessary work when we're close to ENOSPC. The last issue is relatively minor, but the first two make the total_bytes_pinned counter leak or underflow very often. These accounting issues were introduced in b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes"), but they were papered over by zeroing out the counter on every commit until d288db5d ("Btrfs: fix race of using total_bytes_pinned"). We need to make sure that an extent is accounted as pinned exactly once if and only if we will drop references to it when when the transaction is committed. Ideally we would only add to total_bytes_pinned when the *last* reference is dropped, but this information isn't readily available for data extents. Again, this over-estimation can lead to extra commits when we're close to ENOSPC, but it's not as bad as before. The fix implemented here is to increment total_bytes_pinned when the total refmod count for an extent goes negative and decrement it if the refmod count goes back to non-negative or after we've run all of the delayed refs for that extent. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
We need this to decide when to account pinned bytes. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
Currently, we only increment total_bytes_pinned in btrfs_free_tree_block() when dropping the last reference on the block. However, when the delayed ref is run later, we will decrement total_bytes_pinned regardless of whether it was the last reference or not. This causes the counter to underflow when the reference we dropped was not the last reference. Fix it by incrementing the counter unconditionally, which is what btrfs_free_extent() does. This makes total_bytes_pinned an overestimate when references to shared extents are dropped, but in the worst case this will just make us try to commit the transaction to try to free up space and find we didn't free enough. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
The extents marked in pin_down_extent() will be unpinned later in unpin_extent_range(), which decrements total_bytes_pinned. pin_down_extent() must increment the counter to avoid underflowing it. Also adjust btrfs_free_tree_block() to avoid accounting for the same extent twice. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
The value of flags is one of DATA/METADATA/SYSTEM, they must exist at when add_pinned_bytes is called. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ added changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
There are a few places where we pass in a negative num_bytes, so make it signed for clarity. Also move it up in the file since later patches will need it there. Signed-off-by: NOmar Sandoval <osandov@fb.com> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The XATTR_ITEM is a type of a directory item so we use the common validator helper. Unlike other dir items, it can have data. The way the name len validation is currently implemented does not reflect that. We'd have to adjust by the data_len when comparing the read and item limits. However, this will not work for multi-item xattr dir items. Example from tree dump of generic/337: item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147 location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 3 name_len 11 name: user.foobar data 123 location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 6 name_len 13 name: user.WvG1c1Td data qwerty location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 5 name_len 19 name: user.J3__T_Km3dVsW_ data hello At the point of btrfs_is_name_len_valid call we don't have access to the data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf, di) would always return 3, although we'd need to get 6 and 5 respectively to get the claculations right. (read_end + name_len + data_len vs item_end) We'd have to also pass data_len externally, which is not point of the name validation. The last check is supposed to test if there's at least one dir item space after the one we're processing. I don't think this is particularly useful, validation of the next item would catch that too. So the check is removed and we don't weaken the validation. Now tests btrfs/048, btrfs/053, generic/273 and generic/337 pass. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-