- 17 1月, 2020 10 次提交
-
-
由 Jens Axboe 提交于
commit 85f4d4b65fdd67f1d6dc9eeb1d91923cef07eb6a upstream. We currently only really support sync poll, ie poll with 1 IO in flight. This prepares us for supporting async poll. Note that the returned value isn't necessarily 100% accurate. If poll races with IRQ completion, we assume that the fact that the task is now runnable means we found at least one entry. In reality it could be more than 1, or not even 1. This is fine, the caller will just need to take this into account. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit d1e36282b0bbd5de6a9c4d5275e94ef3b3438f48 upstream. We use IOCB_HIPRI to poll for IO in the caller instead of scheduling. This information is not available for (or after) IO submission. The driver may make different queue choices based on the type of IO, so make the fact that we will poll for this IO known to the lower layers as well. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit aa563d7bca6e882ec2bdae24603c8f016401a144 upstream. In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit 00e23707442a75b404392cef1405ab4fd498de6b upstream. Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
commit 42ee3cae0ed38b6c04038bf851ea2496da2135bb upstream. Error handling of the dma_map_single and dma_map_page APIs is a little problematic at the moment, in that we use different encodings in the returned dma_addr_t to indicate an error. That means we require an additional indirect call to figure out if a dma mapping call returned an error, and a lot of boilerplate code to implement these semantics. Instead return the maximum addressable value as the error. As long as we don't allow mapping single-byte ranges with single-byte alignment this value can never be a valid return. Additionaly if drivers do not check the return value from the dma_map* routines this values means they will generally not be pointed to actual memory. Once the default value is added here we can start removing the various mapping_error methods and just rely on this generic check. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NRobin Murphy <robin.murphy@arm.com> Acked-by: NRussell King <rmk+kernel@armlinux.org.uk> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Woods, Brian 提交于
commit be3518a16ef270e3b030a6ae96055f83f51bd3dd upstream. Add the PCI device IDs for family 17h model 30h, since they are needed for accessing various registers via the data fabric/SMN interface. Signed-off-by: NBrian Woods <brian.woods@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> CC: Bjorn Helgaas <bhelgaas@google.com> CC: Clemens Ladisch <clemens@ladisch.de> CC: Guenter Roeck <linux@roeck-us.net> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Ingo Molnar <mingo@redhat.com> CC: Jean Delvare <jdelvare@suse.com> CC: Jia Zhang <qianyue.zj@alibaba-inc.com> CC: <linux-hwmon@vger.kernel.org> CC: <linux-pci@vger.kernel.org> CC: Pu Wen <puwen@hygon.cn> CC: Thomas Gleixner <tglx@linutronix.de> CC: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20181106200754.60722-4-brian.woods@amd.comSigned-off-by: NWANG Siyuan <Siyuan.Wang@amd.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xunlei Pang 提交于
We reserve some fields beforehand for core structures prone to change, so that we won't hurt when extra fields have to be added for hotfix, thereby inceasing the success rate, we even can hot add features with this enhancement. After reserving, normally cache does not matter as the reserved fields (usually at tail) are not accessed at all. Currently involve the following structures: MM: struct zone struct pglist_data struct mm_struct struct vm_area_struct struct mem_cgroup struct writeback_control Block: struct gendisk struct backing_dev_info struct bio struct queue_limits struct request_queue struct blkcg struct blkcg_policy struct blk_mq_hw_ctx struct blk_mq_tag_set struct blk_mq_queue_data struct blk_mq_ops struct elevator_mq_ops struct inode struct dentry struct address_space struct block_device struct hd_struct struct bio_set Network: struct sk_buff struct sock struct net_device_ops struct xt_target struct dst_entry struct dst_ops struct fib_rule Scheduler: struct task_struct struct cfs_rq struct rq struct sched_statistics struct sched_entity struct signal_struct struct task_group struct cpuacct cgroup: struct cgroup_root struct cgroup_subsys_state struct cgroup_subsys struct css_set Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> [ caspar: use SPDX-License-Identifier ] Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot parameter psi_v1 to enable psi cgroup v1 support. Default it is disabled, which means when passing psi=1 boot parameter, we only support cgroup v2. This is to keep consistent with other cgroup v1 features such as cgroup writeback v1 (cgwb_v1). Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Kairui Song 提交于
commit d15d356887e770c5f2dcf963b52c7cb510c9e42d upstream. Currently perf callchain doesn't work well with ORC unwinder when sampling from trace point. We'll get useless in kernel callchain like this: perf 6429 [000] 22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) 7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 5651468729c1 [unknown] (/usr/bin/perf) 5651467ee82a main+0x69a (/usr/bin/perf) 7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) The root cause is that, for trace point events, it doesn't provide a real snapshot of the hardware registers. Instead perf tries to get required caller's registers and compose a fake register snapshot which suppose to contain enough information for start a unwinding. However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the frame pointer, so current frame pointer is returned instead. We get a invalid register combination which confuse the unwinder, and end the stacktrace early. So in such case just don't try dump BP, and let the unwinder start directly when the register is not a real snapshot. Use SP as the skip mark, unwinder will skip all the frames until it meet the frame of the trace point caller. Tested with frame pointer unwinder and ORC unwinder, this makes perf callchain get the full kernel space stacktrace again like this: perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux) ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux) 7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so) 55a22960d9c1 [unknown] (/usr/bin/perf) 55a22958982a main+0x69a (/usr/bin/perf) 7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so) 5541f689495641d7 [unknown] ([unknown]) Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NKairui Song <kasong@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joseph Qi 提交于
Fix the following kernel-doc warning: include/linux/jbd2.h:1184: warning: Function parameter or member 'j_checkpoint_task' not described in 'journal_s' Fixes: 3999cdd9 ("alinux: jbd2: create jbd2-ckpt thread for journal checkpoint") Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 15 1月, 2020 30 次提交
-
-
由 Yang Shi 提交于
The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make deferred split shrinker memcg aware") makes deferred split queue per memcg to resolve memcg pre-mature OOM problem. But, all nodes end up sharing the same queue instead of one queue per-node before the commit. It is not a big deal for memcg limit reclaim, but it may cause global kswapd shrink THPs from a different node. And, 0-day testing reported -19.6% regression of stress-ng's madvise test [1]. I didn't see that much regression on my test box (24 threads, 48GB memory, 2 nodes), with the same test (stress-ng --timeout 1 --metrics-brief --sequential 72 --class vm --exclude spawn,exec), I saw average -3% (run the same test 10 times then calculate the average since the test itself may have most 15% variation according to my test) regression sometimes (not every time, sometimes I didn't see regression at all). This might be caused by deferred split queue lock contention. With some configuration (i.e. just one root memcg) the lock contention my be worse than before (given 2 nodes, two locks are reduced to one lock). So, moving deferred split queue to memcg's nodeinfo to make it NUMA aware again. With this change stress-ng's madvise test shows average 4% improvement sometimes and I didn't see degradation anymore. [1]: https://lore.kernel.org/lkml/20190930084604.GC17687@shao2-debian/ Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Yang Shi 提交于
commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe upstream Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai] Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Yang Shi 提交于
commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 upstream Currently shrinker is just allocated and can work when memcg kmem is enabled. But, THP deferred split shrinker is not slab shrinker, it doesn't make too much sense to have such shrinker depend on memcg kmem. It should be able to reclaim THP even though memcg kmem is disabled. Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker. When memcg kmem is disabled, just such shrinkers can be called in shrinking memcg slab. [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Yang Shi 提交于
commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream Patch series "Make deferred split shrinker memcg aware", v6. Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled. With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem. This patch (of 4): Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches. Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Gavin Shan 提交于
This enables scanning pages in fixed interval to determine their access frequency (hot/cold). The result is exported to user land on basis of memory cgroup by "memory.idle_page_stats". The design is highlighted as below: * A kernel thread is spawn when this feature is enabled by writing non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds". The thread sequentially scans the nodes and their pages that have been chained up in LRU list. * For each page, its corresponding age information is stored in the page flags or array in node. The age represents the scanning intervals in which the page isn't accessed. Also, the page flag (PG_idle) is leveraged. The page's age is increased by one if the idle flag isn't cleared in two consective scans. Otherwise, the page's age is cleared out. Also, the page's age information is cleared when it's free'd so that the stale age information won't be fetched when it's allocated. * Initially, the flag is set, while the access bit in its PTE is cleared out by the thread. In next scanning period, its PTE access bit is synchronized with the page flag: clear the flag if access bit is set. The flag is kept otherwise. For unmapped pages, the flag is cleared when it's accessed. * Eventually, the page's aging information is updated to the unstable bucket of its corresponding memory cgroup, taking as statistics. The unstable bucket (statistics) is copied to stable bucket when all pages in all nodes are scanned for once. The stable bucket (statistics) is exported to user land through "memory.idle_page_stats". TESTING ======= * cgroup1, unmapped pagecache # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128 # # echo 1 > /sys/kernel/mm/kidled/use_hierarchy # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds # mkdir -p /cgroup/memory # mount -tcgroup -o memory /cgroup/memory # echo 1 > /cgroup/memory/memory.use_hierarchy # mkdir -p /cgroup/memory/test # echo 1 > /cgroup/memory/test/memory.use_hierarchy # # echo $$ > /cgroup/memory/test/cgroup.procs # dd if=/ext4/test.data of=/dev/null bs=1M count=128 # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 * cgroup1, mapped pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and access the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 * cgroup1, mapped and locked pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and mlock the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 * cgroup1, anonymous and locked area # < create memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap anonymous area and mlock it > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 * Rerun above test cases in cgroup2 and the results are no exceptional. However, the cgroups are populated in different way as below: # mkdir -p /cgroup # mount -tcgroup2 none /cgroup # echo "+memory" > /cgroup/cgroup.subtree_control # mkdir -p /cgroup/test Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Yang Shi 提交于
Introduce a new interface, wmark_scale_factor, which defines the distance between wmark_high and wmark_low. The unit is in fractions of 10,000. The default value of 50 means the distance between wmark_high and wmark_low is 0.5% of the max limit of the cgroup. The maximum value is 1000, or 10% of the max limit. The distance between wmark_low and wmark_high have impact on how hard memcg kswapd would reclaim. Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
The global kswapd could set memory node to dirty or writeback if current scan find all pages are unqueued dirty or writeback. Then kswapd would write out dirty pages or wait for writeback done. The memcg kswapd behaves like global kswapd, and it should set dirty or writeback state to memcg too if the same condition is met. Since direct reclaim can't write out page caches, the system depends on kswapd to write out dirty pages if scan finds too many dirty pages in order to avoid pre-mature OOM. But, if page cache is dirtied too fast, writing out pages definitely can't catch up with dirtying pages. It is the responsibility of dirty page balance to throttle dirtying pages. Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Yang Shi 提交于
Currently when memory usage exceeds memory cgroup limit, memory cgroup just can do sync direct reclaim. This may incur unexpected stall on some applications which are sensitive to latency. Introduce background async page reclaim mechanism, like what kswapd does. Define memcg memory usage water mark by introducing wmark_ratio interface, which is from 0 to 100 and represents percentage of max limit. The wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is (wmark_high - wmark_high >> 8), which is an empirical value. If wmark_ratio is 0, it means water mark is disabled, both wmark_low and wmark_high is max, which is the default value. If wmark_ratio is setup, when charging page, if usage is greater than wmark_high, which means the available memory of memcg is low, a work would be scheduled to do background page reclaim until memory usage is reduced to wmark_low if possible. Define a dedicated unbound workqueue for scheduling water mark reclaim works. Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This isn't cause any behavior changes and will be used by overlay async IO implementation. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Currently in blk_throtl_bio(), if one bio exceeds its throtl_grp's bps or iops limit, this bio will be queued throtl_grp's throtl_service_queue, then obviously mm subsys will submit more pages, even underlying device can not handle these io requests, also this will make large amount of pages entering writeback prematurely, later if some process writes some of these pages, it will wait for long time. I have done some tests: one process does buffered writes on a 1GB file, and make this process's blkcg max bps limit be 10MB/s, I observe this: #cat /proc/meminfo | grep -i back Writeback: 900024 kB WritebackTmp: 0 kB I think this Writeback value is just too big, indeed many bios have been queued in throtl_grp's throtl_service_queue, if one process try to write the last bio's page in this queue, it will call wait_on_page_writeback(page), which must wait the previous bios to finish and will take long time, we have also see 120s hung task warning in our server. INFO: task kworker/u128:0:30072 blocked for more than 120 seconds. Tainted: G E 4.9.147-013.ali3000_015_test.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u128:0 D 0 30072 2 0x00000000 Workqueue: writeback wb_workfn (flush-8:16) ffff882ddd066b40 0000000000000000 ffff882e5cad3400 ffff882fbe959e80 ffff882fa50b1a00 ffffc9003a5a3768 ffffffff8173325d ffffc9003a5a3780 00ff882e5cad3400 ffff882fbe959e80 ffffffff81360b49 ffff882e5cad3400 Call Trace: [<ffffffff8173325d>] ? __schedule+0x23d/0x6d0 [<ffffffff81360b49>] ? alloc_request_struct+0x19/0x20 [<ffffffff81733726>] schedule+0x36/0x80 [<ffffffff81736c56>] schedule_timeout+0x206/0x4b0 [<ffffffff81036c69>] ? sched_clock+0x9/0x10 [<ffffffff81363073>] ? get_request+0x403/0x810 [<ffffffff8110ca10>] ? ktime_get+0x40/0xb0 [<ffffffff81732f8a>] io_schedule_timeout+0xda/0x170 [<ffffffff81733f90>] ? bit_wait+0x60/0x60 [<ffffffff81733fab>] bit_wait_io+0x1b/0x60 [<ffffffff81733b28>] __wait_on_bit+0x58/0x90 [<ffffffff811b0d91>] ? find_get_pages_tag+0x161/0x2e0 [<ffffffff811aff62>] wait_on_page_bit+0x82/0xa0 [<ffffffff810d47f0>] ? wake_atomic_t_function+0x60/0x60 [<ffffffffa02fc181>] mpage_prepare_extent_to_map+0x2d1/0x310 [ext4] [<ffffffff8121ff65>] ? kmem_cache_alloc+0x185/0x1a0 [<ffffffffa0305a2f>] ? ext4_init_io_end+0x1f/0x40 [ext4] [<ffffffffa0300294>] ext4_writepages+0x404/0xef0 [ext4] [<ffffffff81508c64>] ? scsi_init_io+0x44/0x200 [<ffffffff81398a0f>] ? fprop_fraction_percpu+0x2f/0x80 [<ffffffff811c139e>] do_writepages+0x1e/0x30 [<ffffffff8127c0f5>] __writeback_single_inode+0x45/0x320 [<ffffffff8127c942>] writeback_sb_inodes+0x272/0x600 [<ffffffff8127cf6b>] wb_writeback+0x10b/0x300 [<ffffffff8127d884>] wb_workfn+0xb4/0x380 [<ffffffff810b85e9>] ? try_to_wake_up+0x59/0x3e0 [<ffffffff810a5759>] process_one_work+0x189/0x420 [<ffffffff810a5a3e>] worker_thread+0x4e/0x4b0 [<ffffffff810a59f0>] ? process_one_work+0x420/0x420 [<ffffffff810ac026>] kthread+0xe6/0x100 [<ffffffff810abf40>] ? kthread_park+0x60/0x60 [<ffffffff81738499>] ret_from_fork+0x39/0x50 To fix this issue, we can simply limit throtl_service_queue's max queued bios, currently we limit it to throtl_grp's bps_limit or iops limit, if it still exteeds, we just sleep for a while. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
io throtl stats will blkg_get at the beginning of throttle and then blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be freed if end_io is called twice like dm-thin, which will save origin end_io first, and call its overwrite end_io and then the saved end_io. After that, access blkg is invalid and finally BUG: [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0 [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0 [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0 [ 4417.239232] Oops: 0000 [#1] SMP ...... [ 4417.274070] Call Trace: [ 4417.275407] [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630 [ 4417.276760] [<ffffffff810b3613>] ? wake_up_process+0x23/0x40 [ 4417.278079] [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30 [ 4417.279387] [<ffffffff81095772>] ? insert_work+0x62/0xa0 [ 4417.280697] [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20 [ 4417.282019] [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90 [ 4417.283326] [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360 [ 4417.284637] [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool] [ 4417.285951] [<ffffffff812c9ce7>] generic_make_request+0x27/0x130 [ 4417.287240] [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod] [ 4417.288503] [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod] [ 4417.289778] [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod] [ 4417.291062] [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod] [ 4417.292344] [<ffffffff812c9da2>] generic_make_request+0xe2/0x130 [ 4417.293626] [<ffffffff812c9e61>] submit_bio+0x71/0x150 [ 4417.294909] [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360 [ 4417.296195] [<ffffffff81215acb>] _submit_bh+0x14b/0x220 [ 4417.297484] [<ffffffff81215bb0>] submit_bh+0x10/0x20 [ 4417.298744] [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2] [ 4417.300014] [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0 [ 4417.301268] [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2] [ 4417.302524] [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30 [ 4417.303753] [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2] [ 4417.304950] [<ffffffff8109ffef>] kthread+0xcf/0xe0 [ 4417.306107] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 [ 4417.307255] [<ffffffff81647f18>] ret_from_fork+0x58/0x90 [ 4417.308349] [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140 ...... Now we introduce a new bio flag BIO_THROTL_STATED to make sure blkg_get/put only get called once for the same bio. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to get per-cgroup io delay statistics. io_service_time represents the time spent after io throttle to io completion, while io_wait_time represents the time spent on throttle queue. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Indeed tool iostat's await is not good enough, which is somewhat sketchy and could not show request's latency on device driver's side. Here we add a new counter to track io request's d2c time, also with this patch, we can extend iostat to show this value easily. Note: I had checked how iostat is implemented, it just reads fields it needs, so iostat won't be affected by this change, so does tsar. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
When jbd2 tries to get write access to one buffer, and if this buffer is under writeback with BH_Shadow flag, jbd2 will wait until this buffer has been written to disk, but sometimes the time taken to wait may be much long, especially disk capacity is almost full. Here add a proc entry "force-copy", if its value is not zero, jbd2 will always do meta buffer copy-cout, then we can eliminate the unnecessary wating time here, and reduce long tail latency for buffered-write. I construct such test case below: $cat offline.fio ; fio-rand-RW.job for fiotest [global] name=fio-rand-RW filename=fio-rand-RW rw=randrw rwmixread=60 rwmixwrite=40 bs=4K direct=0 numjobs=4 time_based=1 runtime=900 [file1] size=60G ioengine=sync iodepth=16 $cat online.fio ; fio-seq-write.job for fiotest [global] name=fio-seq-write filename=fio-seq-write rw=write bs=256K direct=0 numjobs=1 time_based=1 runtime=60 [file1] rate=50m size=10G ioengine=sync iodepth=16 With this patch: $cat /proc/fs/jbd2/sda5-8/force_copy 0 online fio almost always get such long tail latency: Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=17855: Thu Nov 15 09:45:57 2018 write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec) clat (usec): min=135, max=4086.6k, avg=867.21, stdev=50338.22 lat (usec): min=139, max=4086.6k, avg=871.16, stdev=50338.22 clat percentiles (usec): | 1.00th=[ 141], 5.00th=[ 143], 10.00th=[ 145], | 20.00th=[ 147], 30.00th=[ 147], 40.00th=[ 149], | 50.00th=[ 149], 60.00th=[ 151], 70.00th=[ 153], | 80.00th=[ 155], 90.00th=[ 159], 95.00th=[ 163], | 99.00th=[ 255], 99.50th=[ 273], 99.90th=[ 429], | 99.95th=[ 441], 99.99th=[3640656] $cat /proc/fs/jbd2/sda5-8/force_copy 1 online fio latency is much better. Jobs: 1 (f=1), 0B/s-0B/s: [W(1)][100.0%][w=50.0MiB/s][w=200 IOPS][eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=8084: Thu Nov 15 09:31:15 2018 write: IOPS=200, BW=50.0MiB/s (52.4MB/s)(3000MiB/60001msec) clat (usec): min=137, max=545, avg=151.35, stdev=16.22 lat (usec): min=140, max=548, avg=155.31, stdev=16.65 clat percentiles (usec): | 1.00th=[ 143], 5.00th=[ 145], 10.00th=[ 145], 20.00th=[ 147], | 30.00th=[ 147], 40.00th=[ 147], 50.00th=[ 149], 60.00th=[ 149], | 70.00th=[ 151], 80.00th=[ 155], 90.00th=[ 157], 95.00th=[ 161], | 99.00th=[ 239], 99.50th=[ 269], 99.90th=[ 420], 99.95th=[ 429], | 99.99th=[ 537] As to the cost: because we'll always need to copy meta buffer, will consume minor cpu time and some memory(at most 32MB for 128MB journal size). Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 zhangliguang 提交于
This is a temporary workaround plan to avoid the limitation when creating hard link cross two projids. Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
This is trying to do jbd2 checkpoint in a specific kernel thread, then checkpoint won't be under io throttle control. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Reviewed by: Baoyou Xie <baoyou.xie@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Dan Williams 提交于
commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware Interface Table), is the first known instance of a memory range described by a unique "target" proximity domain. Where "initiator" and "target" proximity domains is an approach that the ACPI HMAT (Heterogeneous Memory Attributes Table) uses to described the unique performance properties of a memory range relative to a given initiator (e.g. CPU or DMA device). Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y char-device follows the traditional notion of 'numa-node' where the attribute conveys the closest online numa-node. That numa-node attribute is useful for cpu-binding and memory-binding processes *near* the device. However, when the memory range backing a 'pmem', or 'dax' device is onlined (memory hot-add) the memory-only-numa-node representing that address needs to be differentiated from the set of online nodes. In other words, the numa-node association of the device depends on whether you can bind processes *near* the cpu-numa-node in the offline device-case, or bind process *on* the memory-range directly after the backing address range is onlined. Allow for the case that platform firmware describes persistent memory with a unique proximity domain, i.e. when it is distinct from the proximity of DRAM and CPUs that are on the same socket. Plumb the Linux numa-node translation of that proximity through the libnvdimm region device to namespaces that are in device-dax mode. With this in place the proposed kmem driver [1] can optionally discover a unique numa-node number for the address range as it transitions the memory from an offline state managed by a device-driver to an online memory range managed by the core-mm. [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> [yshi: Removed PowerPC stuff which is not applicable 4.19] Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit e765f37b9b8b4fa65682e9a78a2ca2b11d3d9096 upstream. While using new non arhitectural features using PUNIT Mailbox and MMIO read/write interface, still there is need to operate using MSRs to control PUNIT. User space could have used user user-space MSR interface for this, but when user space MSR access is disabled, then it can't. Here only limited number of MSRs are allowed using this new interface. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit 31a166fe9c269af17977e650846ee4ea50361c07 upstream. Add an IOCTL to send mailbox commands to PUNIT using PUNIT PCI device. A limited set of mailbox commands can be sent to PUNIT. This MMIO interface is used by the intel-speed-select tool under tools/x86/power to enumerate and control Intel Speed Select features. The MBOX commands ids and semantics of the message can be checked from the source code of the tool. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit d3a23584294c1f379239a3b52bac13e03fecd147 upstream. Added MMIO interface to read/write specific offsets in PUNIT PCI device which export core priortization. This MMIO interface can be used using ioctl interface on /dev/isst_interface using IOCTL ISST_IF_IO_CMD. This MMIO interface is used by the intel-speed-select tool under tools/x86/power to enumerate and set core priority. The MMIO offsets and semantics of the message can be checked from the source code of the tool. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit fb5b36a413b9f30fba573fc2a596ab7142dfaf12 upstream. Add processing for IOCTL command ISST_IF_GET_PHY_ID. This converts from the Linux logical CPU to PUNIT CPU numbering scheme. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit 35f2c14d2a076b063a76c5bf275c46c0743ba3a0 upstream. Encapsulate common functions which all Intel Speed Select Technology interface drivers can use. This creates API to register misc device for user kernel communication and handle all common IOCTLs. As part of the registry it allows a callback which is to handle domain specific ioctl processing. There can be multiple drivers register for services, which can be built as modules. So this driver handle contention during registry and as well as during removal. Once user space opened the misc device, the registered driver will be prevented from removal. Also once misc device is opened by the user space new client driver can't register, till the misc device is closed. There are two types of client drivers, one to handle mail box interface and the other is to allow direct read/write to some specific MMIO space. This common driver implements IOCTL ISST_IF_GET_PLATFORM_INFO. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Tony Luck 提交于
commit 4cf841e398503990df640f7a7c5b2ea56f11c08c upstream. Some new Intel servers provide an interface so that the OS can ask the BIOS to translate a system physical address to a memory address (socket, memory controller, channel, rank, dimm, etc.). This is useful for EDAC drivers that want to take the address of an error reported in a machine check bank and let the user know which DIMM may need to be replaced. Specification for this interface is available at: https://cdrdv2.intel.com/v1/dl/getContent/603354 [ Based on earlier code by Qiuxu Zhuo <qiuxu.zhuo@intel.com>. ] [ bp: Make the first pr_info() in adxl_init() pr_debug() so that it doesn't pollute every dmesg. ] Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com> CC: Len Brown <lenb@kernel.org> CC: linux-acpi@vger.kernel.org CC: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20181015202620.23610-1-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit 0c2ddedd8bcb88c4100acb9e0fc5ac8752d09501 upstream. RAPL MSR interface supports 2 power limits for package domain, and 1 power limit for other domains, while RAPL MMIO interface supports 2 power limits for both package and dram domains. And when 2 power limits are supported, the FW_LOCK bit is in bit 63 of the register, instead of bit 31. Remove the assumption that only pakcage domain supports 2 power limits. And allow the RAPL interface driver to specify the number of power limits supported, for every single RAPL domain it owns.. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit d978e755aabe215cb67bf713e103ed3916ec306d upstream. RAPL MMIO interface uses 64 bit registers, thus force use 64 bit register for all the RAPL code. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit 3382388d714891fc0f575926189f33d22e7c960b upstream. Split intel_rapl.c to intel_rapl_common.c and intel_rapl_msr.c, where intel_rapl_common.c contains the common code that can be used by both MSR and MMIO interface. intel_rapl_msr.c contains the implementation of RAPL MSR interface. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit beea8df821d928e7755917da6c1e45d6afde5148 upstream. MSR and MMIO RAPL interfaces have different ways to access the registers, thus in order to abstract the register access operations, two callbacks, .read_raw()/.write_raw() are introduced, and they should be implemented by MSR RAPL and MMIO RAPL interface driver respectly. This patch implements them for the MSR I/F only. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit 7fde2712a7adab721eaabafbd8ff93dff3262d35 upstream. MSR and MMIO RAPL interface have different sets of registers, thus the RAPL register address should be obtained from interface specific structure, i.e. struct rapl_if_private, instead. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit 7ebf8eff63b4f349e7b2ded6aa5036d94bdf94b9 upstream. Introduce a new structure, rapl_if_private, to save the private data for different RAPL Interface. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Rui 提交于
commit ff956826a403f5cf189978d5ff6b3eb53aa11610 upstream. Create a new header file for the common definitions that might be used by different RAPL Interface. Reviewed-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Tested-by: NPandruvada, Srinivas <srinivas.pandruvada@intel.com> Signed-off-by: NZhang Rui <rui.zhang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-