- 26 5月, 2016 1 次提交
-
-
由 Ilya Dryomov 提交于
The size of ->r_request and ->r_reply messages depends on the size of the object name (ceph_object_id), while the size of ceph_osd_request is fixed. Move message allocation into a separate function that would have to be called after ceph_object_id and ceph_object_locator (which is also going to become variable in size with RADOS namespaces) have been filled in: req = ceph_osdc_alloc_request(...); <fill in req->r_base_oid> <fill in req->r_base_oloc> ceph_osdc_alloc_messages(req); Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 13 5月, 2016 1 次提交
-
-
由 Andrea Arcangeli 提交于
This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. total_mapcount() isn't the right calculation needed in reuse_swap_page(), so this introduces a page_trans_huge_mapcount() that is effectively the full accurate return value for page_mapcount() if dealing with Transparent Hugepages, however we only use the page_trans_huge_mapcount() during COW faults where it strictly needed, due to its higher runtime cost. This also provide at practical zero cost the total_mapcount information which is needed to know if we can still relocate the page anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we can reuse the page no matter if it's a pte or a pmd_trans_huge triggering the fault, but we can only relocate the page anon_vma to the local vma->anon_vma if we're sure it's only this "vma" mapping the whole THP physical range. Kirill A. Shutemov discovered the problem with moving the page anon_vma to the local vma->anon_vma in a previous version of this patch and another problem in the way page_move_anon_rmap() was called. Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a previous version, because reuse_swap_page must be a macro to call page_trans_huge_mapcount from swap.h, so this uses a macro again instead of an inline function. With this change at least it's a less dangerous usage than it was before, because "page" is used only once now, while with the previous code reuse_swap_page(page++) would have called page_mapcount on page+1 and it would have increased page twice instead of just once. Dean Luick noticed an uninitialized variable that could result in a rmap inefficiency for the non-THP case in a previous version. Mike Marciniszyn said: : Our RDMA tests are seeing an issue with memory locking that bisects to : commit 61f5d698 ("mm: re-enable THP") : : The test program registers two rather large MRs (512M) and RDMA : writes data to a passive peer using the first and RDMA reads it back : into the second MR and compares that data. The sizes are chosen randomly : between 0 and 1024 bytes. : : The test will get through a few (<= 4 iterations) and then gets a : compare error. : : Tracing indicates the kernel logical addresses associated with the individual : pages at registration ARE correct , the data in the "RDMA read response only" : packets ARE correct. : : The "corruption" occurs when the packet crosse two pages that are not physically : contiguous. The second page reads back as zero in the program. : : It looks like the user VA at the point of the compare error no longer points to : the same physical address as was registered. : : This patch totally resolves the issue! Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Reviewed-by: NDean Luick <dean.luick@intel.com> Tested-by: NAlex Williamson <alex.williamson@redhat.com> Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com> Tested-by: NJosh Collier <josh.d.collier@intel.com> Cc: Marc Haber <mh+linux-kernel@zugschlus.de> Cc: <stable@vger.kernel.org> [4.5] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 5月, 2016 2 次提交
-
-
由 Miklos Szeredi 提交于
Overlayfs needs lookup without inode_permission() and already has the name hash (in form of dentry->d_name on overlayfs dentry). It also doesn't support filesystems with d_op->d_hash() so basically it only needs the actual hashed lookup from lookup_one_len_unlocked() So add a new helper that does unlocked lookup of a hashed name. Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Miklos Szeredi 提交于
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
-
- 10 5月, 2016 2 次提交
-
-
由 Josh Poimboeuf 提交于
gcc support for __builtin_bswap16() was supposedly added for powerpc in gcc 4.6, and was then later added for other architectures in gcc 4.8. However, Stephen Rothwell reported that attempting to use it on powerpc in gcc 4.6 fails with: lib/vsprintf.c:160:2: error: initializer element is not constant lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]') lib/vsprintf.c:160:2: error: initializer element is not constant lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]') ... I'm not entirely sure what those errors mean, but I don't see them on gcc 4.8. So let's consider gcc 4.8 to be the official starting point for __builtin_bswap16(). Arnd Bergmann adds: "I found the commit in gcc-4.8 that replaced the powerpc-specific implementation of __builtin_bswap16 with an architecture-independent one. Apparently the powerpc version (gcc-4.6 and 4.7) just mapped to the lhbrx/sthbrx instructions, so it ended up not being a constant, though the intent of the patch was mainly to add support for the builtin to x86: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52624 has the patch that went into gcc-4.8 and more information." Fixes: 7322dd75 ("byteswap: try to avoid __builtin_constant_p gcc bug") Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Tested-by: NStephen Rothwell <sfr@canb.auug.org.au> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Serge E. Hallyn 提交于
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com> Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com> Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 07 5月, 2016 1 次提交
-
-
由 Jarno Rajahalme 提交于
UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: NJarno Rajahalme <jarno@ovn.org> Reviewed-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2016 3 次提交
-
-
由 Andrea Arcangeli 提交于
After the THP refcounting change, obtaining a compound pages from get_user_pages() no longer allows us to assume the entire compound page is immediately mappable from a secondary MMU. A secondary MMU doesn't want to call get_user_pages() more than once for each compound page, in order to know if it can map the whole compound page. So a secondary MMU needs to know from a single get_user_pages() invocation when it can map immediately the entire compound page to avoid a flood of unnecessary secondary MMU faults and spurious atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier users). Ideally instead of the page->_mapcount < 1 check, get_user_pages() should return the granularity of the "page" mapping in the "mm" passed to get_user_pages(). However it's non trivial change to pass the "pmd" status belonging to the "mm" walked by get_user_pages up the stack (up to the caller of get_user_pages). So the fix just checks if there is not a single pte mapping on the page returned by get_user_pages, and in turn if the caller can assume that the whole compound page is mapped in the current "mm" (in a pmd_trans_huge()). In such case the entire compound page is safe to map into the secondary MMU without additional get_user_pages() calls on the surrounding tail/head pages. In addition of being faster, not having to run other get_user_pages() calls also reduces the memory footprint of the secondary MMU fault in case the pmd split happened as result of memory pressure. Without this fix after a MADV_DONTNEED (like invoked by QEMU during postcopy live migration or balloning) or after generic swapping (with a failure in split_huge_page() that would only result in pmd splitting and not a physical page split), KVM would map the whole compound page into the shadow pagetables, despite regular faults or userfaults (like UFFDIO_COPY) may map regular pages into the primary MMU as result of the pte faults, leading to the guest mode and userland mode going out of sync and not working on the same memory at all times. Any other secondary MMU notifier manager (KVM is just one of the many MMU notifier users) will need the same information if it doesn't want to run a flood of get_user_pages_fast and it can support multiple granularity in the secondary MMU mappings, so I think it is justified to be exposed not just to KVM. The other option would be to move transparent_hugepage_adjust to mm/huge_memory.c but that currently has all kind of KVM data structures in it, so it's definitely not a cut-and-paste work, so I couldn't do a fix as cleaner as this one for 4.6. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: "Li, Liang Z" <liang.z.li@intel.com> Cc: Amit Shah <amit.shah@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexandre Bounine 提交于
Fix problems in uapi definitions reported by Gabriel Laskar: (see https://lkml.org/lkml/2016/4/5/205 for details) - move public header file rio_mport_cdev.h to include/uapi/linux directory - change types in data structures passed as IOCTL parameters - improve parameter checking in some IOCTL service routines Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com> Reported-by: NGabriel Laskar <gabriel@lse.epita.fr> Tested-by: NBarry Wood <barry.wood@idt.com> Cc: Gabriel Laskar <gabriel@lse.epita.fr> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com> Cc: Barry Wood <barry.wood@idt.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Cgroup2 currently doesn't have a per-cgroup swappiness setting. We might want to add one later - that's a different discussion - but until we do, the cgroups should always follow the system setting. Otherwise it will be unchangeably set to whatever the ancestor inherited from the system setting at the time of cgroup creation. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 5月, 2016 1 次提交
-
-
由 Alexander Duyck 提交于
We need to perform an additional check on the inner headers to determine if we can offload the checksum for them. Previously this check didn't occur so we would generate an invalid frame in the case of an IPv6 header encapsulated inside of an IPv4 tunnel. To fix this I added a secondary check to vxlan_features_check so that we can verify that we can offload the inner checksum. Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 5月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This is a fairly minimal fixup to the horribly bad behavior of hash_64() with certain input patterns. In particular, because the multiplicative value used for the 64-bit hash was intentionally bit-sparse (so that the multiply could be done with shifts and adds on architectures without hardware multipliers), some bits did not get spread out very much. In particular, certain fairly common bit ranges in the input (roughly bits 12-20: commonly with the most information in them when you hash things like byte offsets in files or memory that have block factors that mean that the low bits are often zero) would not necessarily show up much in the result. There's a bigger patch-series brewing to fix up things more completely, but this is the fairly minimal fix for the 64-bit hashing problem. It simply picks a much better constant multiplier, spreading the bits out a lot better. NOTE! For 32-bit architectures, the bad old hash_64() remains the same for now, since 64-bit multiplies are expensive. The bigger hashing cleanup will replace the 32-bit case with something better. The new constants were picked by George Spelvin who wrote that bigger cleanup series. I just picked out the constants and part of the comment from that series. Cc: stable@vger.kernel.org Cc: George Spelvin <linux@horizon.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 5月, 2016 1 次提交
-
-
由 Tim Bingham 提交于
Prior to commit d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG") the implementation of net_dbg_ratelimited() was buggy for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases. The bug was that net_ratelimit() was being called and, despite returning true, nothing was being printed to the console. This resulted in messages like the following - "net_ratelimit: %d callbacks suppressed" with no other output nearby. After commit d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG") the bug is fixed for the DEBUG case. However, there's no output at all for CONFIG_DYNAMIC_DEBUG case. This patch restores debug output (if enabled) for the CONFIG_DYNAMIC_DEBUG case. Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG case. The implementation takes care to check that dynamic debugging is enabled before calling net_ratelimit(). Fixes: d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG") Signed-off-by: NTim Bingham <tbingham@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 4月, 2016 5 次提交
-
-
由 Gerald Schaefer 提交于
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steve Capper 提交于
HugeTLB pages cannot be split, so we use the compound_mapcount to track rmaps. Currently page_mapped() will check the compound_mapcount, but will also go through the constituent pages of a THP compound page and query the individual _mapcount's too. Unfortunately, page_mapped() does not distinguish between HugeTLB and THP compound pages and assumes that a compound page always needs to have HPAGE_PMD_NR pages querying. For most cases when dealing with HugeTLB this is just inefficient, but for scenarios where the HugeTLB page size is less than the pmd block size (e.g. when using contiguous bit on ARM) this can lead to crashes. This patch adjusts the page_mapped function such that we skip the unnecessary THP reference checks for HugeTLB pages. Fixes: e1534ae9 ("mm: differentiate page_mapped() from page_mapcount() for compound pages") Signed-off-by: NSteve Capper <steve.capper@arm.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexei Starovoitov 提交于
On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK, the malicious application may overflow 32-bit bpf program refcnt. It's also possible to overflow map refcnt on 1Tb system. Impose 32k hard limit which means that the same bpf program or map cannot be shared by more than 32k processes. Fixes: 1be7f75d ("bpf: enable non-root eBPF programs") Reported-by: NJann Horn <jannh@google.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Marcelo Ricardo Leitner 提交于
Fix casting in net_gso_ok. Otherwise the shift on gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like a INT_MIN, which is then promoted from signed to uint64 which is 0xffffffff80000000, resulting in wrong behavior when it is and'ed with the feature itself, as in: This test app: #include <stdio.h> #include <stdint.h> int main(int argc, char **argv) { uint64_t feature1; uint64_t feature2; int gso_type = 1 << 15; feature1 = gso_type << 16; feature2 = (uint64_t)gso_type << 16; printf("%lx %lx\n", feature1, feature2); return 0; } Gives: ffffffff80000000 80000000 So that this: return (features & feature) == feature; Actually works on more bits than expected and invalid ones. Fix is to promote it earlier. Issue noted while rebasing SCTP GSO patch but posting separetely as someone else may experience this meanwhile. Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 4月, 2016 2 次提交
-
-
由 Sagi Grimberg 提交于
mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation where rdma read work queue entries cannot exceed 512 bytes. A rdma_read wqe needs to fit in 512 bytes: - wqe control segment (16 bytes) - rdma segment (16 bytes) - scatter elements (16 bytes each) So max_sge_rd should be: (512 - 16 - 16) / 16 = 30. Cc: linux-stable@vger.kernel.org Reported-by: NChristoph Hellwig <hch@lst.de> Tested-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagig@grimberg.me> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Heikki Krogerus 提交于
Since fwnode may hold ERR_PTR(-ENODEV) or it may be NULL, the fwnode type checks is_of_node(), is_acpi_node() and is is_pset_node() need to consider it. Using IS_ERR_OR_NULL() to check it. Fixes: 0d67e0fa (device property: fix for a case of use-after-free) Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NHeikki Krogerus <heikki.krogerus@linux.intel.com> [ rjw: Subject & changelog ] Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 27 4月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This is more prep-work for the upcoming pty changes. Still just code cleanup with no actual semantic changes. This removes a bunch pointless complexity by just having the slave pty side remember the dentry associated with the devpts slave rather than the inode. That allows us to remove all the "look up the dentry" code for when we want to remove it again. Together with moving the tty pointer from "inode->i_private" to "dentry->d_fsdata" and getting rid of pointless inode locking, this removes about 30 lines of code. Not only is the end result smaller, it's simpler and easier to understand. The old code, for example, depended on the d_find_alias() to not just find the dentry, but also to check that it is still hashed, which in turn validated the tty pointer in the inode. That is a _very_ roundabout way to say "invalidate the cached tty pointer when the dentry is removed". The new code just does dentry->d_fsdata = NULL; in devpts_pty_kill() instead, invalidating the tty pointer rather more directly and obviously. Don't do something complex and subtle when the obvious straightforward approach will do. The rest of the patch (ie apart from code deletion and the above tty pointer clearing) is just switching the calling convention to pass the dentry or file pointer around instead of the inode. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 4月, 2016 4 次提交
-
-
由 David S. Miller 提交于
This reverts commit 841645b5. Ok, this puts the feature back. I've decided to apply David A.'s bug fix and run with that rather than make everyone wait another whole release for this feature. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tejun Heo 提交于
Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset kicks off asynchronous NUMA node migration if necessary during task migration and flushes it from cpuset_post_attach_flush() which is called at the end of __cgroup_procs_write(). This is to avoid performing migration with cgroup_threadgroup_rwsem write-locked which can lead to deadlock through dependency on kworker creation. memcg has a similar issue with charge moving, so let's convert it to an official callback rather than the current one-off cpuset specific function. This patch adds cgroup_subsys->post_attach callback and makes cpuset register cpuset_post_attach_flush() as its ->post_attach. The conversion is mostly one-to-one except that the new callback is called under cgroup_mutex. This is to guarantee that no other migration operations are started before ->post_attach callbacks are finished. cgroup_mutex is one of the outermost mutex in the system and has never been and shouldn't be a problem. We can add specialized synchronization around __cgroup_procs_write() but I don't think there's any noticeable benefit. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
-
由 David S. Miller 提交于
This reverts the following three commits: 70af921d 799977d9 f1705ec1 The feature was ill conceived, has terrible semantics, and has added nothing but regressions to the already fragile ipv6 stack. Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ilya Dryomov 提交于
Starting the kernel client with cephx disabled and then enabling cephx and restarting userspace daemons can result in a crash: [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000 [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130 [262671.584334] PGD 0 [262671.635847] Oops: 0000 [#1] SMP [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014 [262672.268937] Workqueue: ceph-msgr con_work [libceph] [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000 [262672.428330] RIP: 0010:[<ffffffff811cd04a>] [<ffffffff811cd04a>] kfree+0x5a/0x130 [262672.535880] RSP: 0018:ffff880149aeba58 EFLAGS: 00010286 [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018 [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012 [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000 [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40 [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840 [262673.131722] FS: 0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000 [262673.245377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0 [262673.417556] Stack: [262673.472943] ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04 [262673.583767] ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00 [262673.694546] ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800 [262673.805230] Call Trace: [262673.859116] [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph] [262673.968705] [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph] [262674.078852] [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph] [262674.134249] [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph] [262674.189124] [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph] [262674.243749] [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph] [262674.297485] [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph] [262674.350813] [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph] [262674.403312] [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph] [262674.454712] [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690 [262674.505096] [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph] [262674.555104] [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0 [262674.604072] [<ffffffff810901ea>] worker_thread+0x11a/0x470 [262674.652187] [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310 [262674.699022] [<ffffffff810957a2>] kthread+0xd2/0xf0 [262674.744494] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 [262674.789543] [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70 [262674.834094] [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0 What happens is the following: (1) new MON session is established (2) old "none" ac is destroyed (3) new "cephx" ac is constructed ... (4) old OSD session (w/ "none" authorizer) is put ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer) osd->o_auth.authorizer in the "none" case is just a bare pointer into ac, which contains a single static copy for all services. By the time we get to (4), "none" ac, freed in (2), is long gone. On top of that, a new vtable installed in (3) points us at ceph_x_destroy_authorizer(), so we end up trying to destroy a "none" authorizer with a "cephx" destructor operating on invalid memory! To fix this, decouple authorizer destruction from ac and do away with a single static "none" authorizer by making a copy for each OSD or MDS session. Authorizers themselves are independent of ac and so there is no reason for destroy_authorizer() to be an ac op. Make it an op on the authorizer itself by turning ceph_authorizer into a real struct. Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NSage Weil <sage@redhat.com>
-
- 25 4月, 2016 3 次提交
-
-
由 Majd Dibbiny 提交于
This patch introduces kexec support for mlx5. When switching kernels, kexec() calls shutdown, which unloads the driver and cleans its resources. In addition, remove unregister netdev from shutdown flow. This will allow a clean shutdown, even if some netdev clients did not release their reference from this netdev. Releasing The HW resources only is enough as the kernel is shutting down Signed-off-by: NMajd Dibbiny <majd@mellanox.com> Signed-off-by: NTariq Toukan <tariqt@mellanox.com> Signed-off-by: NHaggai Abramovsky <hagaya@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Saeed Mahameed 提交于
Set and report vport MTU rather than physical MTU, Driver will set both vport and physical port mtu and will rely on the query of vport mtu. SRIOV VFs have to report their MTU to their vport manager (PF), and this will allow them to work with any MTU they need without failing the request. Also for some cases where the PF is not a port owner, PF can work with MTU less than the physical port mtu if set physical port mtu didn't take effect. Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Saeed Mahameed 提交于
For set/query MTU port firmware commands the MTU field is 16 bits, here I changed all the "int mtu" parameters of the functions wrapping those firmware commands to be u16. Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 4月, 2016 1 次提交
-
-
由 Peter Zijlstra 提交于
lock_chain::base is used to store an index into the chain_hlocks[] array, however that array contains more elements than can be indexed using the u16. Change the lock_chain structure to use a bitfield to encode the data it needs and add BUILD_BUG_ON() assertions to check the fields are wide enough. Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of that array; as that would wreck the collision detectoring. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 4月, 2016 1 次提交
-
-
由 Daniel Jurgens 提交于
Maintain the PCI status and provide wrappers for enabling and disabling the PCI device. Performing the actions more than once without doing its opposite results in warning logs. This occurred when EEH hotplugged the device causing a warning for disabling an already disabled device. Fixes: 2ba5fbd6 ('net/mlx4_core: Handle AER flow properly') Signed-off-by: NDaniel Jurgens <danielj@mellanox.com> Signed-off-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 4月, 2016 1 次提交
-
-
由 Wei Ni 提交于
The commit 17e8351a consistently use int for temperature, however it missed a few in trip temperature and thermal_core. In current codes, the trip->temperature used "unsigned long" and zone->temperature used"int", if the temperature is negative value, it will get wrong result when compare temperature with trip temperature. This patch can fix it. Signed-off-by: NWei Ni <wni@nvidia.com> Signed-off-by: NEduardo Valentin <edubezval@gmail.com>
-
- 19 4月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This gets rid of the horrible notion of having that struct inode *ptmx_inode be the linchpin of the interface between the pty code and devpts. By de-emphasizing the ptmx inode, a lot of things actually get cleaner, and we will have a much saner way forward. In particular, this will allow us to associate with any particular devpts instance at open-time, and not be artificially tied to one particular ptmx inode. The patch itself is actually fairly straightforward, and apart from some locking and return path cleanups it's pretty mechanical: - the interfaces that devpts exposes all take "struct pts_fs_info *" instead of "struct inode *ptmx_inode" now. NOTE! The "struct pts_fs_info" thing is a completely opaque structure as far as the pty driver is concerned: it's still declared entirely internally to devpts. So the pty code can't actually access it in any way, just pass it as a "cookie" to the devpts code. - the "look up the pts fs info" is now a single clear operation, that also does the reference count increment on the pts superblock. So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get ref" operation (devpts_get_ref(inode)), along with a "put ref" op (devpts_put_ref()). - the pty master "tty->driver_data" field now contains the pts_fs_info, not the ptmx inode. - because we don't care about the ptmx inode any more as some kind of base index, the ref counting can now drop the inode games - it just gets the ref on the superblock. - the pts_fs_info now has a back-pointer to the super_block. That's so that we can easily look up the information we actually need. Although quite often, the pts fs info was actually all we wanted, and not having to look it up based on some magical inode makes things more straightforward. In particular, now that "devpts_get_ref(inode)" operation should really be the *only* place we need to look up what devpts instance we're associated with, and we do it exactly once, at ptmx_open() time. The other side of this is that one ptmx node could now be associated with multiple different devpts instances - you could have a single /dev/ptmx node, and then have multiple mount namespaces with their own instances of devpts mounted on /dev/pts/. And that's all perfectly sane in a model where we just look up the pts instance at open time. This will eventually allow us to get rid of our odd single-vs-multiple pts instance model, but this patch in itself changes no semantics, only an internal binding model. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 4月, 2016 1 次提交
-
-
由 Hariprasad Shenai 提交于
After 104daa71 ("PCI: Determine actual VPD size on first access"), the PCI core computes the valid VPD size by parsing the VPD starting at offset 0x0. We don't attempt to read past that valid size because that causes some devices to crash. However, some devices do have data past that valid size. For example, Chelsio adapters contain two VPD structures, and the driver needs both of them. Add pci_set_vpd_size(). If a driver knows it is safe to read past the end of the VPD data structure at offset 0, it can use pci_set_vpd_size() to allow access to as much data as it needs. [bhelgaas: changelog, split patches, rename to pci_set_vpd_size() and return int (not ssize_t)] Fixes: 104daa71 ("PCI: Determine actual VPD size on first access") Tested-by: NSteve Wise <swise@opengridcomputing.com> Signed-off-by: NCasey Leedom <leedom@chelsio.com> Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-
- 15 4月, 2016 2 次提交
-
-
由 Craig Gallek 提交于
With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NCraig Gallek <kraig@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Linus Torvalds 提交于
A lot of seqfile users seem to be using things like %pK that uses the credentials of the current process, but that is actually completely wrong for filesystem interfaces. The unix semantics for permission checking files is to check permissions at _open_ time, not at read or write time, and that is not just a small detail: passing off stdin/stdout/stderr to a suid application and making the actual IO happen in privileged context is a classic exploit technique. So if we want to be able to look at permissions at read time, we need to use the file open credentials, not the current ones. Normal file accesses can just use "f_cred" (or any of the helper functions that do that, like file_ns_capable()), but the seqfile interfaces do not have any such options. It turns out that seq_file _does_ save away the user_ns information of the file, though. Since user_ns is just part of the full credential information, replace that special case with saving off the cred pointer instead, and suddenly seq_file has all the permission information it needs. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 4月, 2016 2 次提交
-
-
由 Toshi Kani 提交于
After 'commit fc0c2028 ("x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()")', probing a PMEM device hits the BUG() error below on X86_32 kernel. kernel BUG at include/linux/pmem.h:48! memcpy_from_pmem() calls arch_memcpy_from_pmem(), which is unimplemented since CONFIG_ARCH_HAS_PMEM_API is undefined on X86_32. Fix the BUG() error by adding default_memcpy_from_pmem(). Acked-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NToshi Kani <toshi.kani@hpe.com> Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
-
由 Hans de Goede 提交于
Add a new NO_REPORT_LUNS quirk and set it for Seagate drives with an usb-id of: 0bc2:331a, as these will fail to respond to a REPORT_LUNS command. Cc: stable@vger.kernel.org Reported-and-tested-by: NDavid Webb <djw@noc.ac.uk> Signed-off-by: NHans de Goede <hdegoede@redhat.com> Acked-by: NAlan Stern <stern@rowland.harvard.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 4月, 2016 1 次提交
-
-
由 Jaegeuk Kim 提交于
This patch fixes the issue introduced by the ext4 crypto fix in a same manner. For F2FS, however, we flush the pending IOs and wait for a while to acquire free memory. Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
-
- 09 4月, 2016 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 08 4月, 2016 1 次提交
-
-
由 Alexander Duyck 提交于
This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-