提交 · a294c199455187d124b0760fa8f86c13cdaa4b25 · OpenHarmony / kernel_linux

30 3月, 2018 12 次提交

lightnvm: implement get log report chunk helpers · a294c199

由 Javier González 提交于 3月 30, 2018

The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.

This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page. It makes nvme_get_log_ext available
outside of nvme core so that we can use it form lightnvm.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a294c199

lightnvm: make address conversions depend on generic device · 7100d50a

由 Javier González 提交于 3月 30, 2018

On address conversions, use the generic device, instead of the target
device. This allows to use conversions outside of the target's realm.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7100d50a

lightnvm: add support for 2.0 address format · 69471513

由 Javier González 提交于 3月 30, 2018

Add support for 2.0 address format. Also, align address bits for 1.2 and
2.0 to be able to operate on channel and luns without requiring a format
conversion. Use a generic address format for this purpose.

Also, convert the generic operations to the generic format in pblk.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69471513

lightnvm: normalize geometry nomenclature · a40afad9

由 Javier González 提交于 3月 30, 2018

Normalize nomenclature for naming channels, luns, chunks, planes and
sectors as well as derivations in order to improve readability.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a40afad9

lightnvm: complete geo structure with maxoc* · 3f48021b

由 Javier González 提交于 3月 30, 2018

Complete the generic geometry structure with the maxoc and maxocpu
felds, present in the 2.0 spec. Also, expose them through sysfs.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3f48021b

lightnvm: add shorten OCSSD version in geo · f1d4e812

由 Javier González 提交于 3月 30, 2018

Create a shorten version to use in the generic geometry.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1d4e812

lightnvm: add minor version to generic geometry · 3cb98f84

由 Javier González 提交于 3月 30, 2018

Separate the version between major and minor on the generic geometry and
represent it through sysfs in the 2.0 path. The 1.2 path only shows the
major version to preserve the existing user space interface.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3cb98f84

lightnvm: simplify geometry structure · e46f4e48

由 Javier González 提交于 3月 30, 2018

Currently, the device geometry is stored redundantly in the nvm_id and
nvm_geo structures at a device level. Moreover, when instantiating
targets on a specific number of LUNs, these structures are replicated
and manually modified to fit the instance channel and LUN partitioning.

Instead, create a generic geometry around nvm_geo, which can be used by
(i) the underlying device to describe the geometry of the whole device,
and (ii) instances to describe their geometry independently.
Signed-off-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e46f4e48

lightnvm: remove nvm_dev_ops->max_phys_sect · 89a09c56

由 Matias Bjørling 提交于 3月 30, 2018

The value of max_phys_sect is always static. Instead of
defining it in the nvm_dev_ops structure, declare it as a global
value.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

89a09c56

lightnvm: remove max_rq_size · af569398

由 Matias Bjørling 提交于 3月 30, 2018

The field is no longer used.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af569398

lightnvm: add 2.0 geometry identification · 62771fe0

由 Matias Bjørling 提交于 3月 30, 2018

Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62771fe0

lightnvm: flatten nvm_id_group into nvm_id · c6ac3f35

由 Matias Bjørling 提交于 3月 30, 2018

There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.
Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c6ac3f35

28 3月, 2018 1 次提交

blk-mq: Allow PCI vector offset for mapping queues · f23f5bec

由 Keith Busch 提交于 3月 27, 2018

The PCI interrupt vectors intended to be associated with a queue may
not start at 0; a driver may allocate pre_vectors for special use. This
patch adds an offset parameter so blk-mq may find the intended affinity
mask and updates all drivers using this API accordingly.

Cc: Don Brace <don.brace@microsemi.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f23f5bec

26 3月, 2018 1 次提交

net/utils: Introduce inet_addr_is_any · a470143f

由 Sagi Grimberg 提交于 1月 24, 2018

Can be useful to check INET_ANY address for both ipv4/ipv6 addresses.
Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a470143f

22 3月, 2018 1 次提交

block: use 32-bit blk_status_t on Alpha · 6e2fb221

由 Mikulas Patocka 提交于 3月 21, 2018

Early alpha processors cannot write a single byte or word; they read 8
bytes, modify the value in registers and write back 8 bytes.

The type blk_status_t is defined as one byte, it is often written
asynchronously by I/O completion routines, this asynchronous modification
can corrupt content of nearby bytes if these nearby bytes can be written
simultaneously by another CPU.

- one example of such corruption is the structure dm_io where
  "blk_status_t status" is written by an asynchronous completion routine
  and "atomic_t io_count" is modified synchronously
- another example is the structure dm_buffer where "unsigned hold_count"
  is modified synchronously from process context and "blk_status_t
  write_error" is modified asynchronously from bio completion routine

This patch fixes the bug by changing the type blk_status_t to 32 bits if
we are on Alpha and if we are compiling for a processor that doesn't have
the byte-word-extension.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# 4.13+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e2fb221

18 3月, 2018 1 次提交

block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h> · 233bde21

由 Bart Van Assche 提交于 3月 14, 2018

It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.

Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.

Cc: David S. Miller <davem@davemloft.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

233bde21

17 3月, 2018 1 次提交

blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir() · 4c699480

由 Joseph Qi 提交于 3月 16, 2018

We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:

writeback kworker               cgroup_rmdir
                                  cgroup_destroy_locked
                                    kill_css
                                      css_killed_ref_fn
                                        css_killed_work_fn
                                          offline_css
                                            blkcg_css_offline
  blkcg_bio_issue_check
    rcu_read_lock
    blkg_lookup
                                              spin_trylock(q->queue_lock)
                                              blkg_destroy
                                              spin_unlock(q->queue_lock)
    blk_throtl_bio
    spin_lock_irq(q->queue_lock)
    ...
    spin_unlock_irq(q->queue_lock)
  rcu_read_unlock

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae118896 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.

Fixes: ae118896 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
Reported-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4c699480

14 3月, 2018 3 次提交

bsg: split handling of SCSI CDBs vs transport requeues · 17cb960f

由 Christoph Hellwig 提交于 3月 13, 2018

The current BSG design tries to shoe-horn the transport-specific
passthrough commands into the overall framework for SCSI passthrough
requests.  This has a couple problems:

 - each passthrough queue has to set the QUEUE_FLAG_SCSI_PASSTHROUGH flag
   despite not dealing with SCSI commands at all.  Because of that these
   queues could also incorrectly accept SCSI commands from in-kernel
   users or through the legacy SCSI_IOCTL_SEND_COMMAND ioctl.
 - the real SCSI bsg queues also incorrectly accept bsg requests of the
   BSG_SUB_PROTOCOL_SCSI_TRANSPORT type
 - the bsg transport code is almost unredable because it tries to reuse
   different SCSI concepts for its own purpose.

This patch instead adds a new bsg_ops structure to handle the two cases
differently, and thus solves all of the above problems.  Another side
effect is that the bsg-lib queues also don't need to embedd a
struct scsi_request anymore.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

17cb960f

bsg-lib: remove bsg_job.req · ef6fa64f

由 Christoph Hellwig 提交于 3月 13, 2018

Users of the bsg-lib interface should only use the bsg_job data structure
and not know about implementation details of it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBenjamin Block <bblock@linux.vnet.ibm.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef6fa64f

bsg-lib: introduce a timeout field in struct bsg_job · 31156ec3

由 Christoph Hellwig 提交于 3月 13, 2018

The zfcp driver wants to know the timeout for a bsg job, so add a field
to struct bsg_job for it in preparation of not exposing the request
to the bsg-lib users.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBenjamin Block <bblock@linux.vnet.ibm.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

31156ec3

13 3月, 2018 2 次提交

direct-io: Remove unused DIO_SKIP_DIO_COUNT logic · ce3077ee

由 Nikolay Borisov 提交于 2月 23, 2018

This flag was added by fe0f07d0 ("direct-io: only inc/deci
inode->i_dio_count for file systems") as means to optimise the atomic
modificaiton of the variable for blockdevices. However with the advent
of 542ff7bf ("block: new direct I/O implementation") it became
unused. So let's remove it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ce3077ee

direct-io: Remove unused DIO_ASYNC_EXTEND flag · c8f4c36f

由 Nikolay Borisov 提交于 2月 23, 2018

This flag was added by 60392573 ("direct-io: add flag to allow aio
writes beyond i_size") to support XFS. However, with the rework of
XFS' DIO's path to use iomap in acdda3aa ("xfs: use iomap_dio_rw")
it became redundant. So let's remove it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c8f4c36f

09 3月, 2018 4 次提交

block: Move the queue_flag_*() functions from a public into a private header file · 8a0ac14b

由 Bart Van Assche 提交于 3月 07, 2018

This patch helps to avoid that new code gets introduced in block drivers
that manipulates queue flags without holding the queue lock when that
lock should be held.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8a0ac14b

block: Complain if queue_flag_(set|clear)_unlocked() is abused · 1db2008b

由 Bart Van Assche 提交于 3月 07, 2018

Since it is not safe to use queue_flag_(set|clear)_unlocked()
without holding the queue lock after the sysfs entries for a
queue have been created, complain if this happens.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1db2008b

block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}() · 8814ce8a

由 Bart Van Assche 提交于 3月 07, 2018

Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8814ce8a

block: Reorder the queue flag manipulation function definitions · 66f91322

由 Bart Van Assche 提交于 3月 07, 2018

Move the definition of queue_flag_clear_unlocked() up and move the
definition of queue_in_flight() down such that all queue flag
manipulation function definitions become contiguous.

This patch does not change any functionality.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

66f91322

01 3月, 2018 4 次提交

misc: rtsx: rename SG_END macro · f16ee7c7

由 Arnd Bergmann 提交于 3月 01, 2018

A change to the generic scatterlist code caused a conflict with
the rtsx card reader driver:

In file included from drivers/misc/cardreader/rtsx_pcr.c:32:
include/linux/rtsx_pci.h:40: error: "SG_END" redefined [-Werror]

This changes one instance of the driver to prefix SG_END and
related constants.

Fixes: 723fbf56 ("lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings")
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f16ee7c7

block: Add 'lock' as third argument to blk_alloc_queue_node() · 5ee0524b

由 Bart Van Assche 提交于 2月 28, 2018

This patch does not change any functionality.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5ee0524b

sbitmap: use test_and_set_bit_lock()/clear_bit_unlock() · 4ace53f1

由 Omar Sandoval 提交于 2月 27, 2018

sbitmap_queue_get()/sbitmap_queue_clear() are used for
allocating/freeing a resource, so they should provide acquire/release
barrier semantics, respectively. sbitmap_get() currently contains a full
barrier, which is unnecessary, so use test_and_set_bit_lock() instead of
test_and_set_bit() (these are equivalent on x86_64). sbitmap_clear_bit()
does not imply any barriers, which is incorrect, as accesses of the
resource (e.g., request) could potentially get reordered to after the
clear_bit(). Introduce sbitmap_clear_bit_unlock() and use it for
sbitmap_queue_clear() (this only adds a compiler barrier on x86_64). The
other existing user of sbitmap_clear_bit() (the blk-mq software queue
pending map) is serialized through a spinlock and does not need this.
Reported-by: NTejun Heo <tj@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ace53f1

lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings · 723fbf56

由 Anshuman Khandual 提交于 2月 15, 2018

This replaces scatterlist->page_link LSB encodings with SG_CHAIN and
SG_END definitions without any functional change.
Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

723fbf56

27 2月, 2018 3 次提交

genhd: Fix BUG in blkdev_open() · 56c0908c

由 Jan Kara 提交于 2月 26, 2018

When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0				CPU1			CPU2
							del_gendisk()
							  bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
  bdev = bd_acquire()		  bdev = bd_acquire()
  blkdev_get(bdev)
    bd_start_claiming(bdev)
      - finds old inode 'whole'
      bd_prepare_to_claim() -> 0
							  bdev_unhash_inode(whole);
							<device removed>
							<new device under same
							 number created>
				  blkdev_get(bdev);
				    bd_start_claiming(bdev)
				      - finds new inode 'whole'
				      bd_prepare_to_claim()
					- this also succeeds as we have
					  different 'whole' here...
					- bad things happen now as we
					  have two exclusive openers of
					  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).
Reported-and-analyzed-by: NHou Tao <houtao1@huawei.com>
Tested-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

56c0908c

genhd: Add helper put_disk_and_module() · 9df6c299

由 Jan Kara 提交于 2月 26, 2018

Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9df6c299

genhd: Rename get_disk() to get_disk_and_module() · 3079c22e

由 Jan Kara 提交于 2月 26, 2018

Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3079c22e

23 2月, 2018 2 次提交

efivarfs: Limit the rate for non-root to read files · bef3efbe

由 Luck, Tony 提交于 2月 22, 2018

Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).

On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.

Linus suggested per-user rate limit to solve this.

So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.

In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bef3efbe

kconfig.h: Include compiler types to avoid missed struct attributes · 28128c61

由 Kees Cook 提交于 2月 22, 2018

The header files for some structures could get included in such a way
that struct attributes (specifically __randomize_layout from path.h) would
be parsed as variable names instead of attributes. This could lead to
some instances of a structure being unrandomized, causing nasty GPFs, etc.

This patch makes sure the compiler_types.h header is included in
kconfig.h so that we've always got types and struct attributes defined,
since kconfig.h is included from the compiler command line.
Reported-by: NPatrick McLean <chutzpah@gentoo.org>
Root-caused-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Tested-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: 3859a271 ("randstruct: Mark various structs for randomization")
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28128c61

22 2月, 2018 5 次提交

bug.h: work around GCC PR82365 in BUG() · 173a3efd

由 Arnd Bergmann 提交于 2月 21, 2018

Looking at functions with large stack frames across all architectures
led me discovering that BUG() suffers from the same problem as
fortify_panic(), which I've added a workaround for already.

In short, variables that go out of scope by calling a noreturn function
or __builtin_unreachable() keep using stack space in functions
afterwards.

A workaround that was identified is to insert an empty assembler
statement just before calling the function that doesn't return.  I'm
adding a macro "barrier_before_unreachable()" to document this, and
insert calls to that in all instances of BUG() that currently suffer
from this problem.

The files that saw the largest change from this had these frame sizes
before, and much less with my patch:

  fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
  fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
  fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
  fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
  fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
  net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
  net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
  net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
  net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
  net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
  drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]

In case of ARC and CRIS, it turns out that the BUG() implementation
actually does return (or at least the compiler thinks it does),
resulting in lots of warnings about uninitialized variable use and
leaving noreturn functions, such as:

  block/cfq-iosched.c: In function 'cfq_async_queue_prio':
  block/cfq-iosched.c:3804:1: error: control reaches end of non-void function [-Werror=return-type]
  include/linux/dmaengine.h: In function 'dma_maxpq':
  include/linux/dmaengine.h:1123:1: error: control reaches end of non-void function [-Werror=return-type]

This makes them call __builtin_trap() instead, which should normally
dump the stack and kill the current process, like some of the other
architectures already do.

I tried adding barrier_before_unreachable() to panic() and
fortify_panic() as well, but that had very little effect, so I'm not
submitting that patch.

Vineet said:

: For ARC, it is double win.
:
: 1. Fixes 3 -Wreturn-type warnings
:
: | ../net/core/ethtool.c:311:1: warning: control reaches end of non-void function
: [-Wreturn-type]
: | ../kernel/sched/core.c:3246:1: warning: control reaches end of non-void function
: [-Wreturn-type]
: | ../include/linux/sunrpc/svc_xprt.h:180:1: warning: control reaches end of
: non-void function [-Wreturn-type]
:
: 2.  bloat-o-meter reports code size improvements as gcc elides the
:    generated code for stack return.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
Link: http://lkml.kernel.org/r/20171219114112.939391-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
Tested-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc]
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

173a3efd

mm, mlock, vmscan: no more skipping pagevecs · 9c4e6b1a

由 Shakeel Butt 提交于 2月 21, 2018

When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU.  Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.

The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.

This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability.  This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU.  Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

	#0: __pagevec_lru_add_fn	#1: clear_page_mlock

	SetPageLRU()			if (!TestClearPageMlocked())
					  return
	smp_mb() // <--required
					// inside does PageLRU
	if (!PageMlocked())		if (isolate_lru_page())
	  move to evictable LRU		  putback_lru_page()
	else
	  move to unevictable LRU

In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.

In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU().  If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page.  That page will be stranded on the unevictable
LRU.

There is one (good) side effect though.  Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
put such pages to unevictable LRU.

Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c4e6b1a

mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats · c3cc3911

由 Johannes Weiner 提交于 2月 21, 2018

After commit a983b5eb ("mm: memcontrol: fix excessive complexity in
memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK
counts over the course of several days, both the per-memcg stats as well
as the system counter in e.g.  /proc/meminfo.

The conversion from full per-cpu stat counts to per-cpu cached atomic
stat counts introduced an irq-unsafe RMW operation into the updates.

Most stat updates come from process context, but one notable exception
is the NR_WRITEBACK counter.  While writebacks are issued from process
context, they are retired from (soft)irq context.

When writeback completions interrupt the RMW counter updates of new
writebacks being issued, the decs from the completions are lost.

Since the global updates are routed through the joint lruvec API, both
the memcg counters as well as the system counters are affected.

This patch makes the joint stat and event API irq safe.

Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org
Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Debugged-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c3cc3911

Kbuild: always define endianess in kconfig.h · 101110f6

由 Arnd Bergmann 提交于 2月 21, 2018

Build testing with LTO found a couple of files that get compiled
differently depending on whether asm/byteorder.h gets included early
enough or not.  In particular, include/asm-generic/qrwlock_types.h is
affected by this, but there are probably others as well.

The symptom is a series of LTO link time warnings, including these:

    net/netlabel/netlabel_unlabeled.h:223: error: type of 'netlbl_unlhsh_add' does not match original declaration [-Werror=lto-type-mismatch]
     int netlbl_unlhsh_add(struct net *net,
    net/netlabel/netlabel_unlabeled.c:377: note: 'netlbl_unlhsh_add' was previously declared here

    include/net/ipv6.h:360: error: type of 'ipv6_renew_options_kern' does not match original declaration [-Werror=lto-type-mismatch]
     ipv6_renew_options_kern(struct sock *sk,
    net/ipv6/exthdrs.c:1162: note: 'ipv6_renew_options_kern' was previously declared here

    net/core/dev.c:761: note: 'dev_get_by_name_rcu' was previously declared here
     struct net_device *dev_get_by_name_rcu(struct net *net, const char *name)
    net/core/dev.c:761: note: code may be misoptimized unless -fno-strict-aliasing is used

    drivers/gpu/drm/i915/i915_drv.h:3377: error: type of 'i915_gem_object_set_to_wc_domain' does not match original declaration [-Werror=lto-type-mismatch]
     i915_gem_object_set_to_wc_domain(struct drm_i915_gem_object *obj, bool write);
    drivers/gpu/drm/i915/i915_gem.c:3639: note: 'i915_gem_object_set_to_wc_domain' was previously declared here

    include/linux/debugfs.h:92:9: error: type of 'debugfs_attr_read' does not match original declaration [-Werror=lto-type-mismatch]
     ssize_t debugfs_attr_read(struct file *file, char __user *buf,
    fs/debugfs/file.c:318: note: 'debugfs_attr_read' was previously declared here

    include/linux/rwlock_api_smp.h:30: error: type of '_raw_read_unlock' does not match original declaration [-Werror=lto-type-mismatch]
     void __lockfunc _raw_read_unlock(rwlock_t *lock) __releases(lock);
    kernel/locking/spinlock.c:246:26: note: '_raw_read_unlock' was previously declared here

    include/linux/fs.h:3308:5: error: type of 'simple_attr_open' does not match original declaration [-Werror=lto-type-mismatch]
     int simple_attr_open(struct inode *inode, struct file *file,
    fs/libfs.c:795: note: 'simple_attr_open' was previously declared here

All of the above are caused by include/asm-generic/qrwlock_types.h
failing to include asm/byteorder.h after commit e0d02285
("locking/qrwlock: Use 'struct qrwlock' instead of 'struct __qrwlock'")
in linux-4.15.

Similar bugs may or may not exist in older kernels as well, but there is
no easy way to test those with link-time optimizations, and kernels
before 4.14 are harder to fix because they don't have Babu's patch
series

We had similar issues with CONFIG_ symbols in the past and ended up
always including the configuration headers though linux/kconfig.h.  This
works around the issue through that same file, defining either
__BIG_ENDIAN or __LITTLE_ENDIAN depending on CONFIG_CPU_BIG_ENDIAN,
which is now always set on all architectures since commit 4c97a0c8
("arch: define CPU_BIG_ENDIAN for all fixed big endian archs").

Link: http://lkml.kernel.org/r/20180202154104.1522809-2-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

101110f6

include/linux/sched/mm.h: re-inline mmdrop() · d34bc48f

由 Andrew Morton 提交于 2月 21, 2018

As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.

Fixes: d70f2a14 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc")
Acked-by: NPeter Zijlstra (Intel) <peterz@infraded.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d34bc48f

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多