提交 · 08eee69fcf6baea543a2b4d2a2fcba0e61aa3160 · openeuler / Kernel

13 2月, 2015 6 次提交

zram: remove init_lock in zram_make_request · 08eee69f

由 Minchan Kim 提交于 2月 12, 2015

Admin could reset zram during I/O operation going on so we have used
zram->init_lock as read-side lock in I/O path to prevent sudden zram
meta freeing.

However, the init_lock is really troublesome.  We can't do call
zram_meta_alloc under init_lock due to lockdep splat because
zram_rw_page is one of the function under reclaim path and hold it as
read_lock while other places in process context hold it as write_lock.
So, we have used allocation out of the lock to avoid lockdep warn but
it's not good for readability and fainally, I met another lockdep splat
between init_lock and cpu_hotplug from kmem_cache_destroy during working
zsmalloc compaction.  :(

Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
patch removes it in rw path and instead, add atomic refcount for meta
lifetime management and completion to free meta in process context.
It's important to free meta in process context because some of resource
destruction needs mutex lock, which could be held if we releases the
resource in reclaim context so it's deadlock, again.

As a bonus, we could remove init_done check in rw path because
zram_meta_get will do a role for it, instead.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

08eee69f

zram: check bd_openers instead of bd_holders · 2b269ce6

由 Minchan Kim 提交于 2月 12, 2015

bd_holders is increased only when user open the device file as FMODE_EXCL
so if something opens zram0 as !FMODE_EXCL and request I/O while another
user reset zram0, we can see following warning.

  zram0: detected capacity change from 0 to 64424509440
  Buffer I/O error on dev zram0, logical block 180823, lost async page write
  Buffer I/O error on dev zram0, logical block 180824, lost async page write
  Buffer I/O error on dev zram0, logical block 180825, lost async page write
  Buffer I/O error on dev zram0, logical block 180826, lost async page write
  Buffer I/O error on dev zram0, logical block 180827, lost async page write
  Buffer I/O error on dev zram0, logical block 180828, lost async page write
  Buffer I/O error on dev zram0, logical block 180829, lost async page write
  Buffer I/O error on dev zram0, logical block 180830, lost async page write
  Buffer I/O error on dev zram0, logical block 180831, lost async page write
  Buffer I/O error on dev zram0, logical block 180832, lost async page write
  ------------[ cut here ]------------
  WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
  Modules linked in:
  CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x45/0x57
    warn_slowpath_common+0x8a/0xc0
    warn_slowpath_null+0x1a/0x20
    __blkdev_put+0x1d7/0x210
    blkdev_put+0x50/0x130
    blkdev_close+0x25/0x30
    __fput+0xdf/0x1e0
    ____fput+0xe/0x10
    task_work_run+0xa7/0xe0
    do_notify_resume+0x49/0x60
    int_signal+0x12/0x17
  ---[ end trace 274fbbc5664827d2 ]---

The warning comes from bdev_write_node in blkdev_put path.

   static void bdev_write_inode(struct inode *inode)
   {
        spin_lock(&inode->i_lock);
        while (inode->i_state & I_DIRTY) {
                spin_unlock(&inode->i_lock);
                WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                spin_lock(&inode->i_lock);
        }
        spin_unlock(&inode->i_lock);
   }

The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.

If we check bd_openers instead of bd_holders, we could address the
problem.  When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems to be
better.

I can make following warning with below simple script.  In addition, I
added msleep(2000) below set_capacity(zram->disk, 0) after applying your
patch to make window huge(Kudos to Ganesh!)

script:

   echo $((60<<30)) > /sys/block/zram0/disksize
   setsid dd if=/dev/zero of=/dev/zram0 &
   sleep 1
   setsid echo 1 > /sys/block/zram0/reset
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2b269ce6

zram: rework reset and destroy path · a096cafc

由 Sergey Senozhatsky 提交于 2月 12, 2015

We need to return set_capacity(disk, 0) from reset_store() back to
zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
race set_capacity() calls from init and reset paths.

The problem is that zram_reset_device() is also getting called from
zram_exit(), which performs operations in misleading reversed order -- we
first create_device() and then init it, while zram_exit() perform
destroy_device() first and then does zram_reset_device().  This is done to
remove sysfs group before we reset device, so we can continue with device
reset/destruction not being raced by sysfs attr write (f.e.  disksize).

Apart from that, destroy_device() releases zram->disk (but we still have
->disk pointer), so we cannot acces zram->disk in later
zram_reset_device() call, which may cause additional errors in the future.

So, this patch rework and cleanup destroy path.

1) remove several unneeded goto labels in zram_init()

2) factor out zram_init() error path and zram_exit() into
   destroy_devices() function, which takes the number of devices to
   destroy as its argument.

3) remove sysfs group in destroy_devices() first, so we can reorder
   operations -- reset device (as expected) goes before disk destroy and
   queue cleanup.  So we can always access ->disk in zram_reset_device().

4) and, finally, return set_capacity() back under ->init_lock.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a096cafc

zram: fix umount-reset_store-mount race condition · ba6b17d6

由 Sergey Senozhatsky 提交于 2月 12, 2015

Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
avoid ->bd_holders race condition:

        CPU0                            CPU1
umount /* zram->init_done is true */
reset_store()
bdev->bd_holders == 0                   mount
...                                     zram_make_request()
zram_reset_device()

However, his solution required some considerable amount of code movement,
which we can avoid.

Apart from using bdev->bd_mutex in reset_store(), this patch also
simplifies zram_reset_device().

zram_reset_device() has a bool parameter reset_capacity which tells it
whether disk capacity and itself disk should be reset.  There are two
zram_reset_device() callers:

-- zram_exit() passes reset_capacity=false
-- reset_store() passes reset_capacity=true

So we can move reset_capacity-sensitive work out of zram_reset_device()
and perform it unconditionally in reset_store().  This also lets us drop
reset_capacity parameter from zram_reset_device() and pass zram pointer
only.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ba6b17d6

zram: free meta table in zram_meta_free · 1fec1172

由 Ganesh Mahendran 提交于 2月 12, 2015

zram_meta_alloc() and zram_meta_free() are a pair.  In
zram_meta_alloc(), meta table is allocated.  So it it better to free it
in zram_meta_free().
Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1fec1172

zram: clean up zram_meta_alloc() · b8179958

由 Sergey Senozhatsky 提交于 2月 12, 2015

A trivial cleanup of zram_meta_alloc() error handling.
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8179958

12 2月, 2015 5 次提交

mm: gup: use get_user_pages_unlocked · 7e339128

由 Andrea Arcangeli 提交于 2月 11, 2015

This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
the page fault in order to release the mmap_sem during the I/O.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e339128

oom, PM: make OOM detection in the freezer path raceless · c32b3cbe

由 Michal Hocko 提交于 2月 11, 2015

Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter.  The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution.  That requires the full OOM and freezer's task freezing
exclusion, though.  This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations.  Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.

The page fault path is covered now as well although it was assumed to be
safe before.  As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here.  Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation.  The page allocation
path simply fails the allocation same as before.  The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because

	- the victim cannot loop in the page fault handler (it would die
	  on the way out from the exception)
	- it cannot loop in the page allocator because all the further
	  allocation would fail and __GFP_NOFAIL allocations are not
	  acceptable at this stage
	- it shouldn't be blocked on any locks held by frozen tasks
	  (try_to_freeze expects lockless context) and kernel threads and
	  work queues are not frozen yet
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Suggested-by: NTejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c32b3cbe

sysrq: convert printk to pr_* equivalent · 401e4a7c

由 Michal Hocko 提交于 2月 11, 2015

While touching this area let's convert printk to pr_*.  This also makes
the printing of continuation lines done properly.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

401e4a7c

oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60

由 Michal Hocko 提交于 2月 11, 2015

This patchset addresses a race which was described in the changelog for
5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code.  I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

49550b60

xilinx usb2 gadget: get rid of incredibly annoying compile warning · 7796c11c

由 Linus Torvalds 提交于 2月 11, 2015

This one was driving me mad, with several lines of warnings during the
allmodconfig build for a single bogus pointer cast.  The warning was so
verbose due to the indirect macro expansion explanation, and the whole
thing was just for a debug printout.

The bogus pointer-to-integer cast was pointless anyway, so just remove
it, and use '%p' to show the pointer.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7796c11c

11 2月, 2015 2 次提交

sata_dwc_460ex: disable COMPILE_TEST again · 06cc01a0

由 Linus Torvalds 提交于 2月 10, 2015

Commit 84683a7e ("sata_dwc_460ex: enable COMPILE_TEST for the
driver") enabled this driver for non-ppc460-ex platforms, but it was
then disabled for ARM and ARM64 by commit 2de5a9c0 ("sata_dwc_460ex:
disable compilation on ARM and ARM64") because it's too noisy and
broken.

This disabled is entirely, because it's too noisy on x86-64 too, and
there's no point in disabling architectures one by one.  At a minimum,
the code isn't 64-bit clean, and even on 32-bit it is questionable
whether it makes sense.

Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

06cc01a0

mm: remove rest usage of VM_NONLINEAR and pte_file() · 0661a336

由 Kirill A. Shutemov 提交于 2月 10, 2015

One bit in ->vm_flags is unused now!
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0661a336

10 2月, 2015 9 次提交

ARM: 8256/1: driver coamba: add device binding path 'driver_override' · 3cf38571

由 Antonios Motakis 提交于 1月 06, 2015

As already demonstrated with PCI [1] and the platform bus [2], a
driver_override property in sysfs can be used to bypass the id
matching of a device to a AMBA driver. This can be used by VFIO to
bind to any AMBA device requested by the user.

[1] http://lists-archives.com/linux-kernel/28030441-pci-introduce-new-device-binding-path-using-pci_dev-driver_override.html
[2] https://www.redhat.com/archives/libvir-list/2014-April/msg00382.htmlSigned-off-by: NAntonios Motakis <a.motakis@virtualopensystems.com>
Reviewed-by: NKim Phillips <kim.phillips@freescale.com>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

3cf38571

i40e: Fix for stats init function call in Rx setup · f217d6ca

由 Carolyn Wyborny 提交于 2月 09, 2015

This patch fixes indentation issue and error found in argument
reported by static analysis.  Without this patch, sparse and other
static analysis errors will be found.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
Signed-off-by: NCarolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f217d6ca

Merge branch 'pci/host-generic' of... · 5c493df2

由 Rafael J. Wysocki 提交于 2月 09, 2015

Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources

modified: drivers/of/of_pci.c

This fixes a build failure after merging the 'acpi-resources' branch
with the PCI tree caused by bad interactions between that branch and
the only commit in 'pci/host-generic'. Also that commit contains a
bug which can be fixed by removing one line of code, so do that too.

Link: http://marc.info/?l=linux-kernel&m=142344882101429&w=2
Link: http://marc.info/?l=linux-next&m=142346304003932&w=2Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5c493df2

net: Mellanox: Delete unnecessary checks before the function call "vunmap" · 1d966d03

由 Markus Elfring 提交于 2月 09, 2015

The vunmap() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Acked-by: NEli Cohen <eli@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d966d03

cxgb4: Add support in cxgb4 to get expansion rom version via ethtool · ba3f8cd5

由 Hariprasad Shenai 提交于 2月 09, 2015

Add support to get option/expansion rom version flashed in the adapter via
ethtool getdrvinfo function.
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba3f8cd5

IB/mlx4: Reset flow support for IB kernel ULPs · 35f05dab

由 Yishai Hadas 提交于 2月 08, 2015

The driver exposes interfaces that directly relate to HW state. Upon fatal
error, consumers of these interfaces (ULPs) that rely on completion of
all their posted work-request could hang, thereby introducing dependencies
in shutdown order. To prevent this from happening, we manage the
relevant resources (CQs, QPs) that are used by the device. Upon a fatal error,
we now generate simulated completions for outstanding WQEs that were not
completed at the time the HW was reset.

It includes invoking the completion event handler for all involved CQs so that
the ULPs will poll those CQs. When polled we return simulated CQEs with
IB_WC_WR_FLUSH_ERR return code enabling ULPs to clean up their resources and
not wait forever for completions upon receiving remove_one.

The above change requires an extra check in the data path to make sure that when
device is in error state, the simulated CQEs will be returned and no further
WQEs will be posted.
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35f05dab

IB/mlx4: Always use the correct port for mirrored multicast attachments · 824c25c1

由 Moni Shoua 提交于 2月 08, 2015

When attaching a QP to a multicast address in bonded mode, there was an
assumption that the port of the QP must be #1. This assumption isn't the
case under the flow which enables maximal usage of the physical ports.

Fix it by always checking the port of the original flow and create the
mirrored flow on the other port.

Fixes: c6215745 ('IB/mlx4: Load balance ports in port aggregation mode')
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

824c25c1

net/bonding: Fix potential bad memory access during bonding events · 92e584fe

由 Moni Shoua 提交于 2月 08, 2015

When queuing work to send the NETDEV_BONDING_INFO netdev event, it's
possible that when the work is executed, the pointer to the slave
becomes invalid. This can happen if between queuing the event and the
execution of the work, the net-device was un-ensvaled and re-enslaved.

Fix that by queuing a work with the data of the slave instead of the
slave structure.

Fixes: 69e61133 ('net/bonding: Notify state change on slaves')
Reported-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92e584fe

random: Fix fast_mix() function · 19acc77a

由 George Spelvin 提交于 2月 07, 2015

There was a bad typo in commit 43759d4f ("random: use an improved
fast_mix() function") and I didn't notice because it "looked right", so
I saw what I expected to see when I reviewed it.

Only months later did I look and notice it's not the Threefish-inspired
mix function that I had designed and optimized.

Mea Culpa.  Each input bit still has a chance to affect each output bit,
and the fast pool is spilled *long* before it fills, so it's not a total
disaster, but it's definitely not the intended great improvement.

I'm still working on finding better rotation constants.  These are good
enough, but since it's unrolled twice, it's possible to get better
mixing for free by using eight different constants rather than repeating
the same four.
Signed-off-by: NGeorge Spelvin <linux@horizon.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.16+
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

19acc77a

09 2月, 2015 18 次提交

i40e/i40evf: Add call to u64_stats_init to init · 638702bd

由 Carolyn Wyborny 提交于 1月 24, 2015

This patch adds a call to u64_stats_init to Rx setup.
This done in order to avoid lockdep errors with seqcount on newer kernels.

Change-ID: Ia8ba8f0bcbd1c0e926f97d70aeee4ce4fd055e93
Signed-off-by: NCarolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

638702bd

i40e: Enable Loopback for the FCOE vsi as well · 9230165f

由 Anjali Singhai Jain 提交于 1月 24, 2015

For all VSIs on a VEB, Loopback mode should be either on or off.
Our configuration requires them to be ON so that VSIs can directly
talk to each other without going out on the wire.

Change-ID: I77b8310bc846329972b13b185949ab1431a46c30
Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

9230165f

i40e: use dev_port for fcoe netdev · 4d48b566

由 Vasu Dev 提交于 1月 24, 2015

Set different dev_port value 1 for FCoE netdev than the default zero
dev_port value for PF netdev, this helps biosdevname user tool to
differentiate them correctly while both attached to the same PCI
function.

Change-ID: I8fb90e4ef52a1242f7580e49a3f0918735aee8ef
Signed-off-by: NVasu Dev <vasu.dev@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

4d48b566

i40e: Fix function header · 03147773

由 Greg Rose 提交于 1月 24, 2015

s/enable/disable

Change-ID: Ic0572a6c59d03e05a0a35d2e2e9d532e0512638d
Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

03147773

i40e: fix led blink toggle to enable steady state · 9be00d67

由 Matt Jared 提交于 1月 24, 2015

Make sure to clear the GPIO blink field, instead of OR'ing against zero
if the field is already '1'.

Change-ID: Ie52a52abd48f6f52b20778a6b8b0c542dfc9245c
Signed-off-by: NMatt Jared <matthew.a.jared@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

9be00d67

i40evf: Force Tx writeback on ITR · c29af37f

由 Anjali Singhai Jain 提交于 1月 10, 2015

This patch forces Tx descriptor writebacks on ITR by kicking
off the SWINT interrupt when we notice that there are non-cache-aligned
Tx descriptors waiting in the ring while interrupts are disabled
under NAPI.

Change-ID: dd6d9675629bf266c7515ad7a201394618c35444
Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

c29af37f

i40e: stop the service task at shutdown · 88086e5d

由 Mitch Williams 提交于 1月 09, 2015

Stop the service task in the shutdown handler, preventing it from
accessing the admin queue after it had been closed. This fixes a panic
that could occur when the system was shut down with a lot of VFs
enabled.

Change-ID: I286735e3842de472385bbf7ad68d30331e508add
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

88086e5d

cxgb4: Fix trace observed while dumping clip_tbl · acde2c2d

由 Hariprasad Shenai 提交于 2月 09, 2015

Handle clip_tbl debugfs entry, when clip_tbl isn't allocated.
In commit b5a02f50 ("cxgb4: Update ipv6 address handling api") wrong
argument was passed for single_open for clip_tbl debugfs entry, which led to
below trace. Fixing it.

======
call Trace:
 [<ffffffffa073c606>] clip_tbl_open+0x16/0x30 [cxgb4]
 [<ffffffff8119e2fa>] do_dentry_open+0x21a/0x370
 [<ffffffff8119e499>] vfs_open+0x49/0x50
 [<ffffffff811b0d0e>] do_last+0x21e/0x800
 [<ffffffff811b1382>] path_openat+0x92/0x470
 [<ffffffff8110569f>] ? rb_reserve_next_event+0xaf/0x380
 [<ffffffff8110569f>] ? rb_reserve_next_event+0xaf/0x380
 [<ffffffff811b189a>] do_filp_open+0x4a/0xa0
 [<ffffffff811bdc5d>] ? __alloc_fd+0xcd/0x140
 [<ffffffff8119fa4a>] do_sys_open+0x11a/0x230
 [<ffffffff8101219f>] ? syscall_trace_enter_phase2+0xaf/0x1b0
 [<ffffffff8119fb9e>] SyS_open+0x1e/0x20
 [<ffffffff815bf6f0>] tracesys_phase2+0xd4/0xd9
Code: 89 e5 66 66 66 66 90 48 8b 47 e0 48 8b 40 30 48 8b 40 58 c9 c3 66 0f 1f
84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 47 e0 <48> 8b 40 58 c9 c3 66
66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48
RIP  [<ffffffff8120898d>] PDE_DATA+0xd/0x20
 RSP <ffff8800b08c3c48>
CR2: 0000000000000058

=====
Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

acde2c2d

i40evf: stop the watchdog for shutdown · 00293fdc

由 Mitch Williams 提交于 1月 09, 2015

Stop the watchdog during shutdown. Failing to do this causes a log full
of admin queue errors and the occasional hang when the system is shut
down.

Change-ID: Ib2fd11213cca2fa589eb68577e86b1000c23c250
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

00293fdc

i40evf: ignore bogus messages from FW · 8b011ebb

由 Mitch Williams 提交于 1月 09, 2015

Occasionally on shutdown, the FW will hand us a bunch of messages filled
with zeros, which can cause us to spin trying to handle them. Just
ignore these and get on with shutting down.

Change-ID: I347e9648f7153ad5a7b7e0847b87f7aad5f3e0da
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

8b011ebb

i40evf: reset on module unload · f4a71881

由 Mitch Williams 提交于 1月 09, 2015

When the module is being unloaded, don't wait for the PF to politely
handle all of our admin queue requests, as that might take forever with
a lot of VFs enabled. Instead, just stop everything and request a VF
reset.

When the original shutdown code was written, VF resets were unreliable,
so we avoided them. But with production hardware and firmware, and the
1.x PF driver, this is no longer the case.

This fixes a potential multi-minute delay on driver unload, VF disable,
or system shutdown.

Change-ID: Ib43d6d860ef6b9b8f26e8dce0615a0302608c7d9
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

f4a71881

i40e: add locking around VF reset · 3ba9bcb4

由 Mitch Williams 提交于 1月 09, 2015

During VF deallocation, we need to lock out the VF reset code. However,
we cannot depend on simply masking the interrupt, as this does not lock
out the service task, which can still call the reset routine. Instead,
leave the interrupt enabled, but add locking around the VF disable and
reset routines.

For the disable code, we wait to get the lock, as the reset code will
take a finite amount of time to run. For the reset code, we just return
if we fail to get the lock. Since we know that the VFs are being
disabled, we don't need to handle the reset.
This fixes a panic when disabling SR-IOV.

Change-ID: Iea0a6cdef35c331f48c6d5b2f8e6f0e86322e7d8
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

3ba9bcb4

i40e: Use even more ARQ descriptors · 07574897

由 Mitch Williams 提交于 1月 09, 2015

When enabling 64 VFs and loading the VF driver in the host kernel, we
can easily overrun the PF's admin receive queue. Double the size of this
queue, and increase the work limit to allow the PF to handle more
requests in a single pass through the service task.

Change-ID: I0efbbdc61954bffad422a2f33c4b948a59370bf5
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

07574897

i40e: delay after VF reset · 1750a22f

由 Mitch Williams 提交于 1月 09, 2015

Delay a minimum of 10ms after VF reset, to allow the hardware's internal
FIFOs to flush.

Change-ID: I8a02ddb28c9f0d7303a1eb21d0b2443e5b4c1cda
Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

1750a22f

i40e: avoid use of uninitialized v_budget in i40e_init_msix · 83840e4b

由 John W Linville 提交于 1月 14, 2015

This I40E_FCOE block increments v_budget before it has been initialized,
then v_budget gets overwritten a few lines later. This patch just
reorders the code hunks in what I believe was the intended sequence.

Coverity: CID 12600999Signed-off-by: NJohn W Linville <linville@tuxdriver.com>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

83840e4b

i40e: i40e_fcoe.c: Remove unused function · cf86da48

由 Rickard Strandqvist 提交于 1月 07, 2015

Remove the function i40e_rx_is_fip() that is not used anywhere.

This was partially found by using a static code analysis program
called cppcheck.
Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Tested-by: NJim Young <james.m.young@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

cf86da48

vxlan: Wrong type passed to %pIS · a4870f79

由 Rasmus Villemoes 提交于 2月 07, 2015

src_ip is a pointer to a union vxlan_addr, one member of which is a
struct sockaddr. Passing a pointer to src_ip is wrong; one should pass
the value of src_ip itself. Since %pIS formally expects something of
type struct sockaddr*, let's pass a pointer to the appropriate union
member, though this of course doesn't change the generated code.

Fixes: e4c7ed41 ("vxlan: add ipv6 support")
Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4870f79

Driver: Vmxnet3: Change the hex constant to its decimal equivalent · dd83829e

由 Shrikrishna Khare 提交于 2月 06, 2015

The hex constant chosen for VMXNET3_REV1_MAGIC is offensive,
replace it with its decimal equivalent.
Signed-off-by: NShrikrishna Khare <skhare@vmware.com>
Reviewed-by: NShreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd83829e

openeuler / Kernel 12 个月 前同步成功

openeuler / Kernel
12 个月前同步成功