1. 01 7月, 2015 1 次提交
  2. 27 6月, 2015 2 次提交
  3. 26 6月, 2015 37 次提交
    • R
      arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952
      Ross Zwisler 提交于
      Based on an original patch by Ross Zwisler [1].
      
      Writes to persistent memory have the potential to be posted to cpu
      cache, cpu write buffers, and platform write buffers (memory controller)
      before being committed to persistent media.  Provide apis,
      memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
      pmem and assert that it is durable in PMEM (a persistent linear address
      range).  A '__pmem' attribute is added so sparse can track proper usage
      of pointers to pmem.
      
      This continues the status quo of pmem being x86 only for 4.2, but
      reworks to ioremap, and wider implementation of memremap() will enable
      other archs in 4.3.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      [djbw: various reworks]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      61031952
    • T
      libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3
      Toshi Kani 提交于
      Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
      under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.
      
      An example of numa_node values on a 2-socket system with a single
      NVDIMM range on each socket is shown below.
        /sys/bus/nd/devices
        |-- btt0.0/numa_node:0
        |-- btt1.0/numa_node:1
        |-- btt1.1/numa_node:1
        |-- namespace0.0/numa_node:0
        |-- namespace1.0/numa_node:1
        |-- region0/numa_node:0
        |-- region1/numa_node:1
      
      These numa_node files are then linked under the block class of
      their device names.
        /sys/class/block/pmem0/device/numa_node:0
        /sys/class/block/pmem1s/device/numa_node:1
      
      This enables numactl(8) to accept 'block:' and 'file:' paths of
      pmem and btt devices as shown in the examples below.
        numactl --preferred block:pmem0 --show
        numactl --preferred file:/dev/pmem1s --show
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      74ae66c3
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • T
      acpi: Add acpi_map_pxm_to_online_node() · 99759869
      Toshi Kani 提交于
      The kernel initializes CPU & memory's NUMA topology from ACPI
      SRAT table.  Some other ACPI tables, such as NFIT and DMAR, also
      contain proximity IDs for their device's NUMA topology.  This
      information can be used to improve performance of these devices.
      
      This patch introduces acpi_map_pxm_to_online_node(), which is
      similar to acpi_map_pxm_to_node(), but always returns an online
      node.  When the mapped node from a given proximity ID is offline,
      it looks up the node distance table and returns the nearest
      online node.
      
      ACPI device drivers, which are called after the NUMA initialization
      has completed in the kernel, can call this interface to obtain their
      device NUMA topology from ACPI tables.  Such drivers do not have to
      deal with offline nodes.  A node may be offline when a device
      proximity ID is unique, SRAT memory entry does not exist, or NUMA is
      disabled, ex. "numa=off" on x86.
      
      This patch also moves the pxm range check from acpi_get_node() to
      acpi_map_pxm_to_node().
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com&gt;>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      99759869
    • D
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams 提交于
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58138820
    • D
      pmem: flag pmem block devices as non-rotational · 0f51c4fa
      Dan Williams 提交于
      ...since they are effectively SSDs as far as userspace is concerned.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0f51c4fa
    • D
      libnvdimm: enable iostat · f0dc089c
      Dan Williams 提交于
      This is disabled by default as the overhead is prohibitive, but if the
      user takes the action to turn it on we'll oblige.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f0dc089c
    • D
      pmem: make_request cleanups · edc870e5
      Dan Williams 提交于
      Various cleanups:
      
      1/ Kill the BUG_ON since we've already told the block layer we don't
         support DISCARD on all these drivers.
      
      2/ Kill the 'rw' variable, no need to cache it.
      
      3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
         advancing the iterator's sector number by the bio_vec length.
      
      4/ Kill the check for accessing past the end of device
         generic_make_request_checks() already does that.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      [hch: kill access past end of the device check]
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      edc870e5
    • D
      libnvdimm, pmem: fix up max_hw_sectors · 43d3fa3a
      Dan Williams 提交于
      There is no hardware limit to enforce on the size of the i/o that can be passed
      to an nvdimm block device, so set it to UINT_MAX.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      43d3fa3a
    • V
      libnvdimm, blk: add support for blk integrity · fcae6957
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) for nd_blk in the
      same way as done for the BTT. Add the idea of an 'internal' lbasize,
      which is properly aligned and padded, and store metadata in this space.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fcae6957
    • V
      libnvdimm, btt: add support for blk integrity · 41cd8b70
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) using the blk integrity
      framework. This registers a new integrity template that defines the
      protection information tuple size based on the configured metadata size,
      and simply acts as a passthrough for protection information generated by
      another layer. The metadata is written to the storage as-is, and read back
      with each sector.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41cd8b70
    • D
      tools/testing/nvdimm: libnvdimm unit test infrastructure · 6bc75619
      Dan Williams 提交于
      'libnvdimm' is the first driver sub-system in the kernel to implement
      mocking for unit test coverage.  The nfit_test module gets built as an
      external module and arranges for external module replacements of nfit,
      libnvdimm, nd_pmem, and nd_blk.  These replacements use the linker
      --wrap option to redirect calls to ioremap() + request_mem_region() to
      custom defined unit test resources.  The end result is a fully
      functional nvdimm_bus, as far as userspace is concerned, but with the
      capability to perform otherwise destructive tests on emulated resources.
      
      Q: Why not use QEMU for this emulation?
      QEMU is not suitable for unit testing.  QEMU's role is to faithfully
      emulate the platform.  A unit test's role is to unfaithfully implement
      the platform with the goal of triggering bugs in the corners of the
      sub-system implementation.  As bugs are discovered in platforms, or the
      sub-system itself, the unit tests are extended to backstop a fix with a
      reproducer unit test.
      
      Another problem with QEMU is that it would require coordination of 3
      software projects instead of 2 (kernel + libndctl [1]) to maintain and
      execute the tests.  The chances for bit rot and the difficulty of
      getting the tests running goes up non-linearly the more components
      involved.
      
      
      Q: Why submit this to the kernel tree instead of external modules in
         libndctl?
      Simple, to alleviate the same risk that out-of-tree external modules
      face.  Updates to drivers/nvdimm/ can be immediately evaluated to see if
      they have any impact on tools/testing/nvdimm/.
      
      
      Q: What are the negative implications of merging this?
      It is a unique maintenance burden because the purpose of mocking an
      interface to enable a unit test is to purposefully short circuit the
      semantics of a routine to enable testing.  For example
      __wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
      resource buffer allocated by dma_alloc_coherent().  The future
      maintenance burden hits when someone changes the semantics of
      ioremap_cache() and wonders what the implications are for the unit test.
      
      [1]: https://github.com/pmem/ndctl
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6bc75619
    • R
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler 提交于
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      047fc8a1
    • V
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma 提交于
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5212e11f
    • M
      dm cache policy smq: fix "default" version to be 1.4.0 · b5451e45
      Mike Snitzer 提交于
      Commit bccab6a0 ("dm cache: switch the "default" cache replacement
      policy from mq to smq") should've incremented the "default" policy's
      version number to 1.4.0 rather than reverting to version 1.0.0.
      Reported-by: NAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b5451e45
    • M
      Revert "block, dm: don't copy bios for request clones" · 78d8e58a
      Mike Snitzer 提交于
      This reverts commit 5f1b670d.
      
      Justification for revert as reported in this dm-devel post:
      https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html
      
      this change should not be pushed to mainline yet.
      
      Firstly, Christoph has a newer version of the patch that fixes silent
      data corruption problem:
        https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html
      
      And the new version still depends on LLDDs to always complete requests
      to the end when error happens, while block API doesn't enforce such a
      requirement. If the assumption is ever broken, the inconsistency between
      request and bio (e.g. rq->__sector and rq->bio) will cause silent data
      corruption:
        https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      78d8e58a
    • M
      4e6e36c3
    • L
      drm/nouveau: Pause between setting gpu to D3hot and cutting the power · c5fd936e
      Lukas Wunner 提交于
      On the MacBook Pro, power of the gpu is cut by a gmux chip. Sometimes
      the gpu gets stuck in powersaving mode and refuses to wake up
      ("Refused to change power state, currently in D3"). Inserting a
      delay between setting the gpu to D3hot and cutting the power seems
      to help (most of the time). This issue and its (partial) remediation
      by the patch was observed with an Nvidia GT650M (NVE7 / GK107).
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      c5fd936e
    • M
      drivers/firmware/memmap.c: fix kernel-doc format · cbdc2810
      Michal Simek 提交于
      Fix kernel-doc format validation to be able to use kernel-doc script for
      checking it.
      Signed-off-by: NMichal Simek <michal.simek@xilinx.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbdc2810
    • R
      drivers/md/md.c: use strreplace() · 90a9befb
      Rasmus Villemoes 提交于
      There's no point in starting over when we meet a '/'.  This also
      eliminates a stack variable and a little .text.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90a9befb
    • R
      drivers/base/core.c: use strreplace() · a29fd614
      Rasmus Villemoes 提交于
      This eliminates a little .text and avoids repeating the strchr call when
      we meet a '!' (which will happen at least once).
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a29fd614
    • T
      netconsole: implement extended console support · e2f15f9a
      Tejun Heo 提交于
      printk logbuf keeps various metadata and optional key=value dictionary for
      structured messages, both of which are stripped when messages are handed
      to regular console drivers.
      
      It can be useful to have this metadata and dictionary available to
      netconsole consumers.  This obviously makes logging via netconsole more
      complete and the sequence number in particular is useful in environments
      where messages may be lost or reordered in transit - e.g.  when netconsole
      is used to collect messages in a large cluster where packets may have to
      travel congested hops to reach the aggregator.  The lost and reordered
      messages can easily be identified and handled accordingly using the
      sequence numbers.
      
      printk recently added extended console support which can be selected by
      setting CON_EXTENDED flag.  From console driver side, not much changes.
      The only difference is that the text passed to the write callback is
      formatted the same way as /dev/kmsg.
      
      This patch implements extended console support for netconsole which can be
      enabled by either prepending "+" to a netconsole boot param entry or
      echoing 1 to "extended" file in configfs.  When enabled, netconsole
      transmits extended log messages with headers identical to /dev/kmsg
      output.
      
      There's one complication due to message fragments.  netconsole limits the
      maximum message size to 1k and messages longer than that are split into
      multiple fragments.  As all extended console messages should carry
      matching headers and be uniquely identifiable, each extended message
      fragment carries full copy of the metadata and an extra header field to
      identify the specific fragment.  The optional header is of the form
      "ncfrag=OFF/LEN" where OFF is the byte offset into the message body and
      LEN is the total length.
      
      To avoid unnecessarily making printk format extended messages, Extended
      netconsole is registered with printk when the first extended netconsole is
      configured.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2f15f9a
    • T
      netconsole: make all dynamic netconsoles share a mutex · 369e5a88
      Tejun Heo 提交于
      Currently, each dynamic netconsole_target uses its own separate mutex to
      synchronize the configuration operations.
      
      This patch replaces the per-netconsole_target mutexes with a single
      mutex - dynamic_netconsole_mutex.  The reduced granularity doesn't hurt
      anything, the code is minutely simpler and this'd allow adding
      operations which should be synchronized across all dynamic netconsoles.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369e5a88
    • T
      netconsole: make netconsole_target->enabled a bool · 698cf1c6
      Tejun Heo 提交于
      netconsole uses both bool and int for boolean values.  Let's convert
      nt->enabled to bool for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698cf1c6
    • T
      netconsole: remove unnecessary netconsole_target_get/out() from write_msg() · a6d403ac
      Tejun Heo 提交于
      write_msg() grabs target_list_lock and walks target_list invoking
      netpool_send_udp() on each target.  Curiously, it protects each iteration
      with netconsole_target_get/put() even though it never releases
      target_list_lock which protects all the members.
      
      While this doesn't harm anything, it doesn't serve any purpose either.
      The items on the list can't go away while target_list_lock is held.
      Remove the unnecessary get/put pair.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6d403ac
    • C
      drivers/misc/altera-stapl/altera.c: remove extraneous KERN_INFO prefix · 4ae555a5
      Colin Ian King 提交于
      The KERN_INFO prefix is being prepended to KERN_DEBUG when using the
      dprink macro, Remove it as it is extraneous since we are printing the
      message out as debug via dprintk().
      
      Fixes smatch warning:
      
      drivers/misc/altera-stapl/altera.c:2454 altera_init()
         warn: KERN_* level not at start of string
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Igor M. Liplianin <liplianin@netup.ru>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ae555a5
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
    • P
      Mohit Kumar has moved · 9c5dcdd0
      Pratyush Anand 提交于
      Mohit's email-id doesn't exist anymore as he has left the company.
      Replace ST's id with mohit.kumar.dhaka@gmail.com.
      Signed-off-by: NPratyush Anand <pratyush.anand@gmail.com>
      Cc: Mohit Kumar <mohit.kumar.dhaka@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5dcdd0
    • P
      Pratyush Anand has moved · e34cadde
      Pratyush Anand 提交于
      pratyush.anand@st.com email-id doesn't exist anymore as I have left the
      company.  Replace ST's id with pratyush.anand@gmail.com.
      Signed-off-by: NPratyush Anand <pratyush.anand@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e34cadde
    • S
      zram: check comp algorithm availability earlier · d93435c3
      Sergey Senozhatsky 提交于
      Improvement idea by Marcin Jabrzyk.
      
      comp_algorithm_store() silently accepts any supplied algorithm name,
      because zram performs algorithm availability check later, during the
      device configuration phase in disksize_store() and emits the following
      error:
      
        "zram: Cannot initialise %s compressing backend"
      
      this error line is somewhat generic and, besides, can indicate a failed
      attempt to allocate compression backend's working buffers.
      
      add algorithm availability check to comp_algorithm_store():
      
        echo lzz > /sys/block/zram0/comp_algorithm
        -bash: echo: write error: Invalid argument
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMarcin Jabrzyk <m.jabrzyk@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d93435c3
    • S
      zram: cut trailing newline in algorithm name · 4bbacd51
      Sergey Senozhatsky 提交于
      Supplied sysfs values sometimes contain new-line symbols (echo vs.  echo
      -n), which we also copy as a compression algorithm name.  it works fine
      when we lookup for compression algorithm, because we use sysfs_streq()
      which takes care of new line symbols.  however, it doesn't look nice when
      we print compression algorithm name if zcomp_create() failed:
      
       zram: Cannot initialise LXZ
                  compressing backend
      
      cut trailing new-line, so the error string will look like
      
        zram: Cannot initialise LXZ compressing backend
      
      we also now can replace sysfs_streq() in zcomp_available_show() with
      strcmp().
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bbacd51
    • S
      zram: cosmetic zram_bvec_write() cleanup · 17162f41
      Sergey Senozhatsky 提交于
      `bool locked' local variable tells us if we should perform
      zcomp_strm_release() or not (jumped to `out' label before
      zcomp_strm_find() occurred), which is equivalent to `zstrm' being or not
      being NULL.  remove `locked' and check `zstrm' instead.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17162f41
    • S
      zram: add dynamic device add/remove functionality · 6566d1a3
      Sergey Senozhatsky 提交于
      We currently don't support on-demand device creation.  The one and only
      way to have N zram devices is to specify num_devices module parameter
      (default value: 1).  IOW if, for some reason, at some point, user wants
      to have N + 1 devies he/she must umount all the existing devices, unload
      the module, load the module passing num_devices equals to N + 1.  And do
      this again, if needed.
      
      This patch introduces zram control sysfs class, which has two sysfs
      attrs:
      - hot_add      -- add a new zram device
      - hot_remove   -- remove a specific (device_id) zram device
      
      hot_add sysfs attr is read-only and has only automatic device id
      assignment mode (as requested by Minchan Kim).  read operation performed
      on this attr creates a new zram device and returns back its device_id or
      error status.
      
      Usage example:
      	# add a new specific zram device
      	cat /sys/class/zram-control/hot_add
      	2
      
      	# remove a specific zram device
      	echo 4 > /sys/class/zram-control/hot_remove
      
      Returning zram_add() error code back to user (-ENOMEM in this case)
      
      	cat /sys/class/zram-control/hot_add
      	cat: /sys/class/zram-control/hot_add: Cannot allocate memory
      
      NOTE, there might be users who already depend on the fact that at least
      zram0 device gets always created by zram_init(). Preserve this behavior.
      
      [minchan@kernel.org: use zram->claim to avoid lockdep splat]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6566d1a3
    • S
      zram: close race by open overriding · f405c445
      Sergey Senozhatsky 提交于
      [ Original patch from Minchan Kim <minchan@kernel.org> ]
      
      Commit ba6b17d6 ("zram: fix umount-reset_store-mount race
      condition") introduced bdev->bd_mutex to protect a race between mount
      and reset.  At that time, we don't have dynamic zram-add/remove feature
      so it was okay.
      
      However, as we introduce dynamic device feature, bd_mutex became
      trouble.
      
      	CPU 0
      
      echo 1 > /sys/block/zram<id>/reset
        -> kernfs->s_active(A)
          -> zram:reset_store->bd_mutex(B)
      
      	CPU 1
      
      echo <id> > /sys/class/zram/zram-remove
        ->zram:zram_remove: bd_mutex(B)
        -> sysfs_remove_group
          -> kernfs->s_active(A)
      
      IOW, AB -> BA deadlock
      
      The reason we are holding bd_mutex for zram_remove is to prevent
      any incoming open /dev/zram[0-9]. Otherwise, we could remove zram
      others already have opened. But it causes above deadlock problem.
      
      To fix the problem, this patch overrides block_device.open and
      it returns -EBUSY if zram asserts he claims zram to reset so any
      incoming open will be failed so we don't need to hold bd_mutex
      for zram_remove ayn more.
      
      This patch is to prepare for zram-add/remove feature.
      
      [sergey.senozhatsky@gmail.com: simplify reset_store()]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f405c445
    • S
      zram: return zram device_id from zram_add() · 92ff1528
      Sergey Senozhatsky 提交于
      This patch prepares zram to enable on-demand device creation.
      zram_add() performs automatic device_id assignment and returns
      new device id (>= 0) or error code (< 0).
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92ff1528
    • S
      zram: trivial: correct flag operations comment · b31177f2
      Sergey Senozhatsky 提交于
      We don't have meta->tb_lock anymore and use meta table entry bit_spin_lock
      instead. update corresponding comment.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b31177f2
    • S
      zram: report every added and removed device · d12b63c9
      Sergey Senozhatsky 提交于
      With dynamic device creation/removal (which will be introduced later in
      the series) printing num_devices in zram_init() will not make a lot of
      sense, as well as printing the number of destroyed devices in
      destroy_devices().  Print per-device action (added/removed) in zram_add()
      and zram_remove() instead.
      
      Example:
      
      [ 3645.259652] zram: Added device: zram5
      [ 3646.152074] zram: Added device: zram6
      [ 3650.585012] zram: Removed device: zram5
      [ 3655.845584] zram: Added device: zram8
      [ 3660.975223] zram: Removed device: zram6
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d12b63c9