- 15 1月, 2016 10 次提交
-
-
由 Yaowei Bai 提交于
Make memblock_is_memory() and memblock_is_reserved return bool to improve readability due to these particular functions only using either one or zero as their return value. No functional change. Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Make is_file_hugepages() return bool to improve readability due to this particular function only using either one or zero as its return value. This patch also removed the if condition to make is_file_hugepages return directly. No functional change. Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nathan Zimmer 提交于
When running the SPECint_rate gcc on some very large boxes it was noticed that the system was spending lots of time in mpol_shared_policy_lookup(). The gamess benchmark can also show it and is what I mostly used to chase down the issue since the setup for that I found to be easier. To be clear the binaries were on tmpfs because of disk I/O requirements. We then used text replication to avoid icache misses and having all the copies banging on the memory where the instruction code resides. This results in us hitting a bottleneck in mpol_shared_policy_lookup() since lookup is serialised by the shared_policy lock. I have only reproduced this on very large (3k+ cores) boxes. The problem starts showing up at just a few hundred ranks getting worse until it threatens to livelock once it gets large enough. For example on the gamess benchmark at 128 ranks this area consumes only ~1% of time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over 90%. To alleviate the contention in this area I converted the spinlock to an rwlock. This allows a large number of lookups to happen simultaneously. The results were quite good reducing this consumtion at max ranks to around 2%. [akpm@linux-foundation.org: tidy up code comments] Signed-off-by: NNathan Zimmer <nzimmer@sgi.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chen Gang 提交于
__phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are semmetric: - y = (phys_addr_t)x << PAGE_SHIFT - y >> PAGE_SHIFT = (phys_add_t)x - (unsigned long)(y >> PAGE_SHIFT) = x [akpm@linux-foundation.org: use macro arg name `x'] [arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition] Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Currently, if we want to account all objects of a particular kmem cache, we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed to kmem_cache_create will force accounting for every allocation from this cache even if __GFP_ACCOUNT is not passed. This patch does not make any of the existing caches use this flag - it will be done later in the series. Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and hence cannot have different sets of SLAB_* flags. Thus using this flag will probably reduce the number of merged slabs even if kmem accounting is not used (only compiled in). Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This reverts commit 8f4fc071 ("gfp: add __GFP_NOACCOUNT"). Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So it was decided to switch to the white-list policy. This patch reverts bits introducing the black-list policy. The white-list policy will be introduced later in the series. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
A little cleanup - the invocation site provdes the semicolon. Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeff Layton 提交于
At the time that this code was originally written, call_srcu didn't exist, so this thread was required to ensure that we waited for that SRCU grace period to settle before finally freeing the object. It does exist now however and we can much more efficiently use call_srcu to handle this. That also allows us to potentially use srcu_barrier to ensure that they are all of the callbacks have run before proceeding. In order to conserve space, we union the rcu_head with the g_list. This will be necessary for nfsd which will allocate marks from a dedicated slabcache. We have to be able to ensure that all of the objects are destroyed before destroying the cache. That's fairly Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Cc: Eric Paris <eparis@parisplace.org> Reviewed-by: NJan Kara <jack@suse.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 1月, 2016 7 次提交
-
-
由 Tom Herbert 提交于
Obviously need to 'or in NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM. Fixes: c8cd0989 ("net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM") Reported-by: NJack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
Adding flow steering support by creating a flow-table per priority (if rules exist in the priority). mlx5_ib uses autogrouping and thus only creates the required destinations. Also includes adding of these flow steering utilities 1. Parsing verbs flow attributes hardware steering specs. 2. Check if flow is multicast - this is required in order to decide to which flow table will we add the steering rule. 3. Set outer headers in flow match criteria to zeros. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
Change the mlx5 firmware interface header to make it more clear which bytes should be used by IPv4 or IPv6 addresses. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
When the driver is loaded, we create flow steering namespace for kernel bypass with nine priorities and another namespace for leftovers(in order to catch packets that weren't matched). Verbs applications will use these priorities. we found nine as a number that balances the requirements from the user and retains performance. The bypass namespace is used by verbs applications that want to bypass the kernel networking stack. The leftovers namespace is used by verbs applications and the sniffer in order to catch packets that weren't handled by any preceding rules. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
Introduce the modify flow table command. This command is used when we want to change the next flow table of an existing flow table. The next flow table is defined as the table we search (in order to find a match), if we couldn't find a match in any of the flow table entries in the current flow table. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
The root Flow Table for each Flow Table Type is defined, by default, as the Flow Table with level 0. In order not to use an empty flow tables and introduce new hops, but still preserve space for flow-tables that have a priority greater(lower number) than the current flow table, we introduce this new set root flow table command. This command tells the HW to start matching packets from the assigned root flow table. This command is used when we create new flow table with level lower than the current lowest flow table or it is the first flow table. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maor Gottlieb 提交于
When user add rule to autogrouped flow table, we search for flow group with the same match criteria, if we don't find such group then we create new flow group with the required match criteria and insert the rule to this group. We divide the flow table into required_groups + 1, in order to reserve a part of the flow table for rules which don't match any existing group. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Signed-off-by: NMoni Shoua <monis@mellanox.com> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 1月, 2016 3 次提交
-
-
由 willy tarreau 提交于
It is possible for a process to allocate and accumulate far more FDs than the process' limit by sending them over a unix socket then closing them to keep the process' fd count low. This change addresses this problem by keeping track of the number of FDs in flight per user and preventing non-privileged processes from having more FDs in flight than their configured FD limit. Reported-by: socketpair@gmail.com Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This work adds a generalization of the ingress qdisc as a qdisc holding only classifiers. The clsact qdisc works on ingress, but also on egress. In both cases, it's execution happens without taking the qdisc lock, and the main difference for the egress part compared to prior version of [1] is that this can be applied with _any_ underlying real egress qdisc (also classless ones). Besides solving the use-case of [1], that is, allowing for more programmability on assigning skb->priority for the mqprio case that is supported by most popular 10G+ NICs, it also opens up a lot more flexibility for other tc applications. The main work on classification can already be done at clsact egress time if the use-case allows and state stored for later retrieval f.e. again in skb->priority with major/minors (which is checked by most classful qdiscs before consulting tc_classify()) and/or in other skb fields like skb->tc_index for some light-weight post-processing to get to the eventual classid in case of a classful qdisc. Another use case is that the clsact egress part allows to have a central egress counterpart to the ingress classifiers, so that classifiers can easily share state (e.g. in cls_bpf via eBPF maps) for ingress and egress. Currently, default setups like mq + pfifo_fast would require for this to use, for example, prio qdisc instead (to get a tc_classify() run) and to duplicate the egress classifier for each queue. With clsact, it allows for leaving the setup as is, it can additionally assign skb->priority to put the skb in one of pfifo_fast's bands and it can share state with maps. Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid) w/o the need to perform a skb_dst_force() to hold on to it any longer. In lwt case, we can also use this facility to setup dst metadata via cls_bpf (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for that (case of IFF_NO_QUEUE devices, for example). The realization can be done without any changes to the scheduler core framework. All it takes is that we have two a-priori defined minors/child classes, where we can mux between ingress and egress classifier list (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to dev->_tx to avoid extra cacheline miss for moderate loads). The egress part is a bit similar modelled to handle_ing() and patched to a noop in case the functionality is not used. Both handlers are now called sch_handle_ingress() and sch_handle_egress(), code sharing among the two doesn't seem practical as there are various minor differences in both paths, so that making them conditional in a single handler would rather slow things down. Full compatibility to ingress qdisc is provided as well. Since both piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist per netdevice, and thus ingress qdisc specific behaviour can be retained for user space. This means, either a user does 'tc qdisc add dev foo ingress' and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact' alternative, where both, ingress and egress classifier can be configured as in the below example. ingress qdisc supports attaching classifier to any minor number whereas clsact has two fixed minors for muxing between the lists, therefore to not break user space setups, they are better done as two separate qdiscs. I decided to extend the sch_ingress module with clsact functionality so that commonly used code can be reused, the module is being aliased with sch_clsact so that it can be auto-loaded properly. Alternative would have been to add a flag when initializing ingress to alter its behaviour plus aliasing to a different name (as it's more than just ingress). However, the first would end up, based on the flag, choosing the new/old behaviour by calling different function implementations to handle each anyway, the latter would require to register ingress qdisc once again under different alias. So, this really begs to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops by its own that share callbacks used by both. Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will show an empty list for clsact. Either using the parent names (ingress/egress) or specifying the full major/minor will then show the related filter lists. Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend. [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.r.fastabend@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Add a small helper skb_postpush_rcsum() and fix up redirect locations that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects a proper csum that covers also Ethernet header, f.e. since 2c26d34b ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we also do skb_postpull_rcsum() after pulling Ethernet header off via eth_type_trans(). When using eBPF in a netns setup f.e. with vxlan in collect metadata mode, I can trigger the following csum issue with an IPv6 setup: [ 505.144065] dummy1: hw csum failure [...] [ 505.144108] Call Trace: [ 505.144112] <IRQ> [<ffffffff81372f08>] dump_stack+0x44/0x5c [ 505.144134] [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40 [ 505.144142] [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0 [ 505.144149] [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120 [ 505.144161] [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6] [ 505.144170] [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables] [ 505.144177] [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6] [ 505.144189] [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack] [ 505.144196] [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6] [ 505.144204] [<ffffffff8164385d>] nf_iterate+0x5d/0x70 [ 505.144210] [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0 [ 505.144218] [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0 [ 505.144225] [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0 [ 505.144232] [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0 [ 505.144239] [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60 [ 505.144245] [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60 [ 505.144252] [<ffffffff8160ccff>] process_backlog+0x9f/0x140 [ 505.144259] [<ffffffff8160c4a5>] net_rx_action+0x145/0x320 [...] What happens is that on ingress, we push Ethernet header back in, either from cls_bpf or right before skb_do_redirect(), but without updating csum. The "hw csum failure" can be fixed by using the new skb_postpush_rcsum() helper for the dev_forward_skb() case to correct the csum diff again. Thanks to Hannes Frederic Sowa for the csum_partial() idea! Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper") Fixes: 27b29f63 ("bpf: add bpf_redirect() helper") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 1月, 2016 5 次提交
-
-
由 Dan Williams 提交于
These actions are completely managed by a block driver or can use the badblocks api directly. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
Provide a devres interface for initializing a badblocks instance. The pmem driver has several scenarios where it will be beneficial to have this structure automatically freed when the device is disabled / fails probe. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
For symmetry with badblocks_init() make it clear that this path only destroys incremental allocations of a badblocks instance, and does not free the badblocks instance itself. Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Vishal Verma 提交于
During region creation, perform Address Range Scrubs (ARS) for the SPA (System Physical Address) ranges to retrieve known poison locations from firmware. Add a new data structure 'nd_poison' which is used as a list in nvdimm_bus to store these poison locations. When creating a pmem namespace, if there is any known poison associated with its physical address space, convert the poison ranges to bad sectors that are exposed using the badblocks interface. Signed-off-by: NVishal Verma <vishal.l.verma@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Vishal Verma 提交于
NVDIMM devices, which can behave more like DRAM rather than block devices, may develop bad cache lines, or 'poison'. A block device exposed by the pmem driver can then consume poison via a read (or write), and cause a machine check. On platforms without machine check recovery features, this would mean a crash. The block device maintaining a runtime list of all known sectors that have poison can directly avoid this, and also provide a path forward to enable proper handling/recovery for DAX faults on such a device. Use the new badblock management interfaces to add a badblocks list to gendisks. Signed-off-by: NVishal Verma <vishal.l.verma@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 09 1月, 2016 11 次提交
-
-
由 Vishal Verma 提交于
Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: NVishal Verma <vishal.l.verma@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Dan Williams 提交于
If an application wants exclusive access to all of the persistent memory provided by an NVDIMM namespace it can use this raw-block-dax facility to forgo establishing a filesystem. This capability is targeted primarily to hypervisors wanting to provision persistent memory for guests. It can be disabled / enabled dynamically via the new BLKDAXSET ioctl. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Reviewed-by: NJan Kara <jack@suse.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 NeilBrown 提交于
We need information about exports when crossing mountpoints during lookup or NFSv4 readdir. If we don't already have that information cached, we may have to ask (and wait for) rpc.mountd. In both cases we currently hold the i_mutex on the parent of the directory we're asking rpc.mountd about. We've seen situations where rpc.mountd performs some operation on that directory that tries to take the i_mutex again, resulting in deadlock. With some care, we may be able to avoid that in rpc.mountd. But it seems better just to avoid holding a mutex while waiting on userspace. It appears that lookup_one_len is pretty much the only operation that needs the i_mutex. So we could just drop the i_mutex elsewhere and do something like mutex_lock() lookup_one_len() mutex_unlock() In many cases though the lookup would have been cached and not required the i_mutex, so it's more efficient to create a lookup_one_len() variant that only takes the i_mutex when necessary. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Daniel Borkmann 提交于
Move the details behind the cb[] access into a small helper to decouple and make them generic for bpf_prog_run_save_cb()/bpf_prog_run_clear_cb() that was introduced via commit ff936a04 ("bpf: fix cb access in socket filter programs"). Also add a comment to better clarify what is done in bpf_skb_cb(). Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Paolo Bonzini 提交于
Since the numbers now overlap, it makes sense to enumerate them in asm/kvm_host.h rather than linux/kvm_host.h. Functions that refer to architecture-specific requests are also moved to arch/. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Leave room for 4 more arch-independent requests. Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Suggested-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> [Takuya moved all subsequent constants to fill the void, but that is useless in view of the following patches. So this change looks nothing like the original. - Paolo] Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Danesh Petigara 提交于
The link resume logic uses a 200msec delay while debouncing the SControl register. The rationale behind that delay is to accommodate some PHYs that behave badly if their SStatus/ SControl registers are pounded immediately on resume. The Broadcom STB SATA PHY does not seem to have this issue. This patch introduces a new link flag that allows platforms to skip the debounce delay if it isn't needed. Signed-off-by: NDanesh Petigara <dpetigara@broadcom.com> Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Jeff Layton 提交于
...so we can print information about it if there are leaked locks. Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
-
- 08 1月, 2016 4 次提交
-
-
由 Steven Rostedt (Red Hat) 提交于
Qiu Peiyang pointed out that there's a race when enabling function tracing and loading a module. In order to make the modifications of converting nops in the prologue of functions into callbacks, the text needs to be converted from read-only to read-write. When enabling function tracing, the text permission is updated, the functions are modified, and then they are put back. When loading a module, the updates to convert function calls to mcount is done before the module text is set to read-only. But after it is done, the module text is visible by the function tracer. Thus we have the following race: CPU 0 CPU 1 ----- ----- start function tracing set text to read-write load_module add functions to ftrace set module text read-only update all functions to callbacks modify module functions too < Can't it's read-only > When this happens, ftrace detects the issue and disables itself till the next reboot. To fix this, a new DISABLED flag is added for ftrace records, which all module functions get when they are added. Then later, after the module code is all set, the records will have the DISABLED flag cleared, and they will be enabled if any callback wants all functions to be traced. Note, this doesn't add the delay to later. It simply changes the ftrace_module_init() to do both the setting of DISABLED records, and then immediately calls the enable code. This helps with testing this new code as it has the same behavior as previously. Another change will come after this to have the ftrace_module_enable() called after the text is set to read-only. Cc: Qiu Peiyang <peiyangx.qiu@intel.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Andrew Lunn 提交于
Make device_free and device_remove operations in the mdio device structure, so the core code does not need to differentiate between phy devices and generic mdio devices. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Not all devices on an MDIO bus are PHYs. Meaning not all MDIO drivers are PHY drivers. Add support for generic MDIO drivers. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Matching a driver to a device has both generic parts, and parts which are specific to PHY devices. Move the PHY specific parts into phy_device. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-