提交 · 1018016c706f7ff9f56fde3a649789c47085a293 · openanolis / cloud-kernel

08 5月, 2015 7 次提交

sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability · 1018016c

由 Jason Low 提交于 4月 28, 2015

While running a database workload, we found a scalability issue with itimers.

Much of the problem was caused by the thread_group_cputimer spinlock.
Each time we account for group system/user time, we need to obtain a
thread_group_cputimer's spinlock to update the timers. On larger systems
(such as a 16 socket machine), this caused more than 30% of total time
spent trying to obtain this kernel lock to update these group timer stats.

This patch converts the timers to 64-bit atomic variables and use
atomic add to update them without a lock. With this patch, the percent
of total time spent updating thread group cputimer timers was reduced
from 30% down to less than 1%.

Note: On 32-bit systems using the generic 64-bit atomics, this causes
sample_group_cputimer() to take locks 3 times instead of just 1 time.
However, we tested this patch on a 32-bit system ARM system using the
generic atomics and did not find the overhead to be much of an issue.
An explanation for why this isn't an issue is that 32-bit systems usually
have small numbers of CPUs, and cacheline contention from extra spinlocks
called periodically is not really apparent on smaller systems.
Signed-off-by: NJason Low <jason.low2@hp.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1018016c

sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() · 316c1608

由 Jason Low 提交于 4月 28, 2015

ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.
Signed-off-by: NJason Low <jason.low2@hp.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NWaiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

316c1608

signals, ptrace, sched: Fix a misaligned load inside ptrace_attach() · e7cc4173

由 Palmer Dabbelt 提交于 4月 30, 2015

The misaligned load exception arises when running ptrace_attach() on
the RISC-V (which hasn't been upstreamed yet).  The problem is that
wait_on_bit() takes a void* but then proceeds to call test_bit(),
which takes a long*.  This allows an int-aligned pointer to be passed
to test_bit(), which promptly fails.  This will manifest on any other
asm-generic port where unaligned loads trap, where sizeof(long) >
sizeof(int), and where task_struct.jobctl ends up not being
long-aligned.

This patch changes task_struct.jobctl to be a long, which ensures it
has the correct alignment.
Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-2-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e7cc4173

sched/wait: Change wait_on_bit*() to take an unsigned long *, not a void * · 7e605987

由 Palmer Dabbelt 提交于 4月 30, 2015

The implementations of wait_on_bit*() will only work with long-aligned
memory on systems that don't support misaligned loads and stores.

This patch changes the function prototypes to ensure that the compiler
will enforce alignment.

Running

  make defconfig
  make KFLAGS="-Werror"

seems to indicate that, as of c56fb6564dcd ("Fix a misaligned load
inside ptrace_attach()"), there are now no users of non-long-aligned
calls to wait_on_bit*().  I additionally tried a few "make randconfig"
attempts, none of which failed to compile for this reason.
Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-3-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

7e605987

signals, sched: Change all uses of JOBCTL_* from 'int' to 'long' · b76808e6

由 Palmer Dabbelt 提交于 4月 30, 2015

c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
jobctl an "unsigned long".  It makes sense to have the masks applied
to it match that type.  This is currently just a cosmetic change, but
it will prevent the mask from being unexpectedly truncated if we ever
end up with masks with more bits.

One instance of "signr" is an int, but I left this alone because the
mask ensures that it will never overflow.
Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b76808e6

sched: Move the loadavg code to a more obvious location · 3289bdb4

由 Peter Zijlstra 提交于 4月 14, 2015

I could not find the loadavg code.. turns out it was hidden in a file
called proc.c. It further got mingled up with the cruft per rq load
indexes (which we really want to get rid of).

Move the per rq load indexes into the fair.c load-balance code (that's
the only thing that uses them) and rename proc.c to loadavg.c so we
can find it again.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Did minor cleanups to the code. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

3289bdb4

sched: Handle priority boosted tasks proper in setscheduler() · 0782e63b

由 Thomas Gleixner 提交于 5月 05, 2015

Ronny reported that the following scenario is not handled correctly:

	T1 (prio = 10)
	   lock(rtmutex);

	T2 (prio = 20)
	   lock(rtmutex)
	      boost T1

	T1 (prio = 20)
	   sys_set_scheduler(prio = 30)
	   T1 prio = 30
	   ....
	   sys_set_scheduler(prio = 10)
	   T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.
Reported-by: NRonny Meeus <ronny.meeus@gmail.com>
Tested-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: c365c292 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>

0782e63b

06 5月, 2015 4 次提交

nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() · d8fd150f

由 Ryusuke Konishi 提交于 5月 05, 2015

The range check for b-tree level parameter in nilfs_btree_root_broken()
is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
though the level is limited to values in the range of 0 to
(NILFS_BTREE_LEVEL_MAX - 1).

Since the level parameter is read from storage device and used to index
nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
can cause memory overrun during btree operations if the boundary value
is set to the level parameter on device.

This fixes the broken sanity check and adds a comment to clarify that
the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8fd150f

util_macros.h: have array pointer point to array of constants · 05836c37

由 Guenter Roeck 提交于 5月 05, 2015

Using the new find_closest() macro can result in the following sparse
warnings.

  drivers/hwmon/lm85.c:194:16: warning:
  		incorrect type in initializer (different modifiers)
  drivers/hwmon/lm85.c:194:16:    expected int *__fc_a
  drivers/hwmon/lm85.c:194:16:    got int static const [toplevel] *<noident>
  drivers/hwmon/lm85.c:210:16: warning:
  		incorrect type in initializer (different modifiers)
  drivers/hwmon/lm85.c:210:16:    expected int *__fc_a
  drivers/hwmon/lm85.c:210:16:    got int const *map

This is because the array passed to find_closest() will typically be
declared as array of constants, but the macro declares a non-constant
pointer to it.
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05836c37

IB/core: Fix unaligned accesses · 0d0f738f

由 David Ahern 提交于 5月 03, 2015

Addresses the following kernel logs seen during boot of sparc systems:

Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Signed-off-by: NDavid Ahern <david.ahern@oracle.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

0d0f738f

IB/core: change rdma_gid2ip into void function as it always return zero · 471e7058

由 Honggang LI 提交于 4月 29, 2015

Signed-off-by: NHonggang Li <honli@redhat.com>
Acked-by: NSean Hefty <sean.hefty@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

471e7058

05 5月, 2015 1 次提交

RDMA/core: Enable the iWarp Port Mapper to provide the actual address of the... · 6eec1774

由 Tatyana Nikolova 提交于 4月 21, 2015

RDMA/core: Enable the iWarp Port Mapper to provide the actual address of the connecting peer to its clients

Add functionality to enable the port mapper on the passive side to provide to its
clients the actual (non-mapped) ip/tcp address information of the connecting peer

1) Adding remote_info_cb() to process the address info of the connecting peer
The address info is provided by the user space port mapper service when
the connection is initiated by the peer
2) Adding a hash list to store the remote address info
3) Adding functionality to add/remove the remote address info
After the info has been provided to the port mapper client,
it is removed from the hash list
Signed-off-by: NTatyana Nikolova <tatyana.e.nikolova@intel.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

6eec1774

04 5月, 2015 1 次提交

lib: make memzero_explicit more robust against dead store elimination · 7829fb09

由 Daniel Borkmann 提交于 4月 30, 2015

In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
case LTO would decide to inline memzero_explicit() and eventually
find out it could be elimiated as dead store.

While using barrier() works well for the case of gcc, recent efforts
from LLVMLinux people suggest to use llvm as an alternative to gcc,
and there, Stephan found in a simple stand-alone user space example
that llvm could nevertheless optimize and thus elimitate the memset().
A similar issue has been observed in the referenced llvm bug report,
which is regarded as not-a-bug.

Based on some experiments, icc is a bit special on its own, while it
doesn't seem to eliminate the memset(), it could do so with an own
implementation, and then result in similar findings as with llvm.

The fix in this patch now works for all three compilers (also tested
with more aggressive optimization levels). Arguably, in the current
kernel tree it's more of a theoretical issue, but imho, it's better
to be pedantic about it.

It's clearly visible with gcc/llvm though, with the below code: if we
would have used barrier() only here, llvm would have omitted clearing,
not so with barrier_data() variant:

  static inline void memzero_explicit(void *s, size_t count)
  {
    memset(s, 0, count);
    barrier_data(s);
  }

  int main(void)
  {
    char buff[20];
    memzero_explicit(buff, sizeof(buff));
    return 0;
  }

  $ gcc -O2 test.c
  $ gdb a.out
  (gdb) disassemble main
  Dump of assembler code for function main:
   0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
   0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
   0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
   0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
   0x000000000040041f <+31>: xor   %eax,%eax
   0x0000000000400421 <+33>: retq
  End of assembler dump.

  $ clang -O2 test.c
  $ gdb a.out
  (gdb) disassemble main
  Dump of assembler code for function main:
   0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
   0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
   0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
   0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
   0x0000000000400505 <+21>: xor    %eax,%eax
   0x0000000000400507 <+23>: retq
  End of assembler dump.

As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
this in compiler-gcc.h only to be picked up. For a fallback or otherwise
unsupported compiler, we define it as a barrier. Similarly, for ecc which
does not support gcc inline asm.

Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
Tested-by: NStephan Mueller <smueller@chronox.de>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Stephan Mueller <smueller@chronox.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: mancha security <mancha1@zoho.com>
Cc: Mark Charlebois <charlebm@gmail.com>
Cc: Behan Webster <behanw@converseincode.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

7829fb09

02 5月, 2015 1 次提交

virtio: fix typo in vring_need_event() doc comment · e412d3a3

由 Stefan Hajnoczi 提交于 5月 02, 2015

Here the "other side" refers to the guest or host.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e412d3a3

30 4月, 2015 2 次提交

bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da

由 Nicolas Dichtel 提交于 4月 28, 2015

NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: e5a55a89 ("net: create generic bridge ops")
Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Sathya Perla <sathya.perla@emulex.com>
CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
CC: Ajit Khaparde <ajit.khaparde@emulex.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko <jiri@resnulli.us>
CC: Scott Feldman <sfeldma@gmail.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46c264da

xen: Suspend ticks on all CPUs during suspend · 2b953a5e

由 Boris Ostrovsky 提交于 4月 28, 2015

Commit 77e32c89 ("clockevents: Manage device's state separately for
the core") decouples clockevent device's modes from states. With this
change when a Xen guest tries to resume, it won't be calling its
set_mode op which needs to be done on each VCPU in order to make the
hypervisor aware that we are in oneshot mode.

This happens because clockevents_tick_resume() (which is an intermediate
step of resuming ticks on a processor) doesn't call clockevents_set_state()
anymore and because during suspend clockevent devices on all VCPUs (except
for the one doing the suspend) are left in ONESHOT state. As result, during
resume the clockevents state machine will assume that device is already
where it should be and doesn't need to be updated.

To avoid this problem we should suspend ticks on all VCPUs during
suspend.
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

2b953a5e

29 4月, 2015 2 次提交

ALSA: emu10k1: Emu10k2 32 bit DMA mode · 7241ea55

由 Peter Zubaj 提交于 4月 28, 2015

Looks like audigy emu10k2 (probably emu10k1 - sb live too) support two
modes for DMA. Second mode is useful for 64 bit os with more then 2 GB
of ram (fixes problems with big soundfont loading)

1) 32MB from 2 GB address space using 8192 pages (used now as default)
2) 16MB from 4 GB address space using 4096 pages

Mode is set using HCFG_EXPANDED_MEM flag in HCFG register.
Also format of emu10k2 page table is then different.
Signed-off-by: NPeter Zubaj <pzubaj@marticonet.sk>
Tested-by: NTakashi Iwai <tiwai@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: NTakashi Iwai <tiwai@suse.de>

7241ea55

ACPICA: remove duplicate u8 typedef · 9e9d55e6

由 Olaf Hering 提交于 4月 28, 2015

During commit e252652f ("ACPICA: acpidump: Remove integer types
translation protection.") two 'unsigned char' types got converted to 'u8'.

The result does not compile with gcc-4.5, it can not cope with duplicate
typedefs.
Signed-off-by: NOlaf Hering <olaf@aepfle.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

9e9d55e6

28 4月, 2015 4 次提交

ASoC: Update email-id of Rajeev Kumar · 9d7dd6cd

由 Rajeev Kumar 提交于 4月 28, 2015

rajeev-dlh.kumar@st.com email-id doesn't exist anymore as I have left the
company. Replace ST's id with Rajeev Kumar <rajeevkumar.linux@gmail.com>
Signed-off-by: NRajeev Kumar <rajeevkumar.linux@gmail.com>
Signed-off-by: NMark Brown <broonie@kernel.org>

9d7dd6cd

tty: Re-add external interface for tty_set_termios() · b00f5c2d

由 Frederic Danis 提交于 4月 10, 2015

This is needed by Bluetooth hci_uart module to be able to change speed
of Bluetooth controller and local UART.
Signed-off-by: NFrederic Danis <frederic.danis@linux.intel.com>
Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b00f5c2d

uas: Add US_FL_MAX_SECTORS_240 flag · ee136af4

由 Hans de Goede 提交于 4月 21, 2015

The usb-storage driver sets max_sectors = 240 in its scsi-host template,
for uas we do not want to do that for all devices, but testing has shown
that some devices need it.

This commit adds a US_FL_MAX_SECTORS_240 flag for such devices, and
implements support for it in uas.c, while at it it also adds support
for US_FL_MAX_SECTORS_64 to uas.c.

Cc: stable@vger.kernel.org # 3.16
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Acked-by: NAlan Stern <stern@rowland.harvard.edu>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

ee136af4

SCSI: add 1024 max sectors black list flag · 35e9a9f9

由 Mike Christie 提交于 4月 20, 2015

This works around a issue with qnap iscsi targets not handling large IOs
very well.

The target returns:

VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 1 blocks
  Optimal transfer length granularity: 1 blocks
  Maximum transfer length: 4294967295 blocks
  Optimal transfer length: 4294967295 blocks
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
  Maximum unmap LBA count: 8388607
  Maximum unmap block descriptor count: 1
  Optimal unmap granularity: 16383
  Unmap granularity alignment valid: 0
  Unmap granularity alignment: 0
  Maximum write same length: 0xffffffff blocks
  Maximum atomic transfer length: 0
  Atomic alignment: 0
  Atomic transfer length granularity: 0

and it is *sometimes* able to handle at least one IO of size up to 8 MB. We
have seen in traces where it will sometimes work, but other times it
looks like it fails and it looks like it returns failures if we send
multiple large IOs sometimes. Also it looks like it can return 2 different
errors. It will sometimes send iscsi reject errors indicating out of
resources or it will send invalid cdb illegal requests check conditions.
And then when it sends iscsi rejects it does not seem to handle retries
when there are command sequence holes, so I could not just add code to
try and gracefully handle that error code.

The problem is that we do not have a good contact for the company,
so we are not able to determine under what conditions it returns
which error and why it sometimes works.

So, this patch just adds a new black list flag to set targets like this to
the old max safe sectors of 1024. The max_hw_sectors changes added in 3.19
caused this regression, so I also ccing stable.
Reported-by: NChristian Hesse <list@eworm.de>
Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJames Bottomley <JBottomley@Odin.com>

35e9a9f9

27 4月, 2015 3 次提交

x86: pvclock: Really remove the sched notifier for cross-cpu migrations · 73459e2a

由 Paolo Bonzini 提交于 4月 23, 2015

This reverts commits 0a4e6be9
and 80f7fdb1.

The task migration notifier was originally introduced in order to support
the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
with synchronized TSC.  Hence, on KVM the race condition is only needed
due to a bad implementation on the host side, and even then it's so rare
that it's mostly theoretical.

As far as KVM is concerned it's possible to fix the host, avoiding the
additional complexity in the vDSO and the (re)introduction of the task
migration notifier.

Xen, on the other hand, hasn't yet implemented vsyscall support at
all, so we do not care about its plans for non-synchronized TSC.
Reported-by: NPeter Zijlstra <peterz@infradead.org>
Suggested-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

73459e2a

xen/grant: introduce func gnttab_unmap_refs_sync() · b44166cd

由 Bob Liu 提交于 4月 03, 2015

There are several place using gnttab async unmap and wait for
completion, so move the common code to a function
gnttab_unmap_refs_sync().
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>

b44166cd

net/bonding: Make DRV macros private · 73b5a6f2

由 Matan Barak 提交于 4月 26, 2015

The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION

Fixing that by defining a private bonding_priv.h
header files which includes those defines.
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73b5a6f2

26 4月, 2015 1 次提交

net: fix crash in build_skb() · 2ea2f62c

由 Eric Dumazet 提交于 4月 24, 2015

When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067]    (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ea2f62c

25 4月, 2015 3 次提交

fix I_DIO_WAKEUP definition · ac74d8d6

由 Eric Sandeen 提交于 4月 16, 2015

I_DIO_WAKEUP is never directly used, but fix it up anyway.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ac74d8d6

direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0

由 Jens Axboe 提交于 4月 15, 2015

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fe0f07d0

netfilter: bridge: fix NULL deref in physin/out ifindex helpers · 547c4b54

由 Florian Westphal 提交于 4月 20, 2015

Might not have an outdev yet. We'll oops when iface goes down while skbs
are still nfqueue'd:

RIP: 0010:[<ffffffff81422a2f>]  [<ffffffff81422a2f>] dev_cmp+0x4f/0x80
nfqnl_rcv_dev_event+0xe2/0x150
notifier_call_chain+0x53/0xa0

Fixes: c737b7c4 ("netfilter: bridge: add helpers for fetching physin/outdev")
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

547c4b54

24 4月, 2015 6 次提交

inet: fix possible panic in reqsk_queue_unlink() · b357a364

由 Eric Dumazet 提交于 4月 23, 2015

[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b357a364

rhashtable: don't attempt to grow when at max_size · 1d8dc3d3

由 Johannes Berg 提交于 4月 23, 2015

The conversion of mac80211's station table to rhashtable had a bug
that I found by accident in code review, that hadn't been found as
rhashtable apparently managed to have a maximum hash chain length
of one (!) in all our testing.

In order to test the bug and verify the fix I set my rhashtable's
max_size very low (4) in order to force getting hash collisions.

At that point, rhashtable WARNed in rhashtable_insert_rehash() but
didn't actually reject the hash table insertion. This caused it to
lose insertions - my master list of stations would have 9 entries,
but the rhashtable only had 5. This may warrant a deeper look, but
that WARN_ON() just shouldn't happen.

Fix this by not returning true from rht_grow_above_100() when the
rhashtable's max_size has been reached - in this case the user is
explicitly configuring it to be at most that big, so even if it's
now above 100% it shouldn't attempt to resize.

This fixes the "lost insertion" issue and consequently allows my
code to display its error (and verify my fix for it.)
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d8dc3d3

NFS: Move nfs_idmap.h into fs/nfs/ · 40c64c26

由 Anna Schumaker 提交于 4月 15, 2015

This file is only used internally to the NFS v4 module, so it doesn't
need to be in the global include path.  I also renamed it from
nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
file.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

40c64c26

NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h · f9ebd618

由 Anna Schumaker 提交于 4月 15, 2015

The idmapper is completely internal to the NFS v4 module, so this macro
will always evaluate to true. This patch also removes unnecessary
includes of this file from the generic NFS client.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

f9ebd618

sunrpc: make debugfs file creation failure non-fatal · 3f940098

由 Jeff Layton 提交于 3月 31, 2015

v2: gracefully handle the case where some dentry pointers end up NULL
    and be more dilligent about zeroing out dentry pointers

We currently have a problem that SELinux policy is being enforced when
creating debugfs files. If a debugfs file is created as a side effect of
doing some syscall, then that creation can fail if the SELinux policy
for that process prevents it.

This seems wrong. We don't do that for files under /proc, for instance,
so Bruce has proposed a patch to fix that.

While discussing that patch however, Greg K.H. stated:

    "No kernel code should care / fail if a debugfs function fails, so
     please fix up the sunrpc code first."

This patch converts all of the sunrpc debugfs setup code to be void
return functins, and the callers to not look for errors from those
functions.

This should allow rpc_clnt and rpc_xprt creation to work, even if the
kernel fails to create debugfs files for some reason.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

3f940098

NFS: Don't zap caches on fallocate() · 9a51940b

由 Anna Schumaker 提交于 3月 16, 2015

This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
operations so we can set the updated inode size and change attribute
directly.  DEALLOCATE will still need to release pagecache pages, so
nfs42_proc_deallocate() now calls truncate_pagecache_range() before
contacting the server.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

9a51940b

23 4月, 2015 5 次提交

netdev_alloc_pcpu_stats: use less common iterator variable · ec65aafb

由 Johannes Berg 提交于 4月 23, 2015

With the CPU iteration variable called 'i', it's relatively easy
to have variable shadowing which sparse will warn about. Avoid
that by renaming the variable to __cpu which is less likely to
be used in the surrounding context.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec65aafb

kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP · 7e01b5ac

由 Martin Schwidefsky 提交于 4月 16, 2015

Introduce KEXEC_CONTROL_MEMORY_GFP to allow the architecture code
to override the gfp flags of the allocation for the kexec control
page. The loop in kimage_alloc_normal_control_pages allocates pages
with GFP_KERNEL until a page is found that happens to have an
address smaller than the KEXEC_CONTROL_MEMORY_LIMIT. On systems
with a large memory size but a small KEXEC_CONTROL_MEMORY_LIMIT
the loop will keep allocating memory until the oom killer steps in.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

7e01b5ac

ASoC: dapm: Enable autodisable on SOC_DAPM_SINGLE_TLV_AUTODISABLE · a2d97723

由 Charles Keepax 提交于 4月 22, 2015

Correct small copy and paste error where autodisable was not being
enabled for the SOC_DAPM_SINGLE_TLV_AUTODISABLE control.
Signed-off-by: NCharles Keepax <ckeepax@opensource.wolfsonmicro.com>
Signed-off-by: NMark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org

a2d97723

mpls: Per-device MPLS state · 03c57747

由 Robert Shearman 提交于 4月 22, 2015

Add per-device MPLS state to supported interfaces. Use the presence of
this state in mpls_route_add to determine that this is a supported
interface.

Use the presence of mpls_dev to drop packets that arrived on an
unsupported interface - previously they were allowed through.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NRobert Shearman <rshearma@brocade.com>
Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03c57747

Revert "mm: avoid tail page refcounting on non-THP compound pages" · f3ca10dd

由 Linus Torvalds 提交于 4月 22, 2015

This reverts commit 8d63d99a.

It causes in VM mapping refcount errors:

  page:ffffea0010a15040 count:0 mapcount:1 mapping:          (null) index:0x0
  flags: 0x8000000000008014(referenced|dirty|tail)
  page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) != 0)
  ------------[ cut here ]------------
  kernel BUG at mm/swap.c:134!

as reported by Borislav Petkov
Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f3ca10dd

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功