1. 15 7月, 2015 1 次提交
    • A
      cgroup: allow a cgroup subsystem to reject a fork · 7e47682e
      Aleksa Sarai 提交于
      Add a new cgroup subsystem callback can_fork that conditionally
      states whether or not the fork is accepted or rejected by a cgroup
      policy. In addition, add a cancel_fork callback so that if an error
      occurs later in the forking process, any state modified by can_fork can
      be reverted.
      
      Allow for a private opaque pointer to be passed from cgroup_can_fork to
      cgroup_post_fork, allowing for the fork state to be stored by each
      subsystem separately.
      
      Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG>
      enumerations to be be defined and used. In addition, explicitly add a
      CGROUP_CANFORK_COUNT macro to make arrays easier to define.
      
      This is in preparation for implementing the pids cgroup subsystem.
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7e47682e
  2. 05 7月, 2015 2 次提交
  3. 04 7月, 2015 3 次提交
  4. 03 7月, 2015 1 次提交
    • J
      irqchip: Move IRQCHIP_DECLARE macro to include/linux/irqchip.h · 91e20b50
      Joel Porquet 提交于
      At the moment the IRQCHIP_DECLARE macro is only declared locally in
      drivers/irqchip/irqchip.h. It prevents from using it directly in arch/*
      directories whenever irqchip drivers only exist there, which happens in a few
      cases (e.g. arc, arm, microblaze and mips).
      
      This patch makes the macro to be globally defined, i.e. in
      include/linux/irqchip.h, and thus usable for arch-specific declarations of
      irqchip drivers. In this way, it is very similar to what clocksource does (ie
      CLOCKSOURCE_OF_DECLARE is defined in include/linux/clocksource.h).
      
      For now, this patch only moves the declaration of the macro
      IRQCHIP_DECLARE to the global header 'include/linux/irqchip.h' and make
      'drivers/irqchip/irqchip.h' include 'include/linux/irqchip.h'. Later, other
      patches will get rid of 'drivers/irqchip/irqchip.h' and modify all the impacted
      irqchip drivers.
      Signed-off-by: NJoel Porquet <joel@porquet.org>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Link: http://lkml.kernel.org/r/1435865565-14114-1-git-send-email-joel@porquet.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      91e20b50
  5. 02 7月, 2015 2 次提交
    • T
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo 提交于
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
        implemented.
      
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: NJon Christopherson <jon@jons.org>
      
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a13f35e8
    • A
      NTB: Move files in preparation for NTB abstraction · ec110bc7
      Allen Hubbe 提交于
      This patch only moves files to their new locations, before applying the
      next two patches adding the NTB Abstraction layer.  Splitting this patch
      from the next is intended make distinct which code is changed only due
      to moving the files, versus which are substantial code changes in adding
      the NTB Abstraction layer.
      Signed-off-by: NAllen Hubbe <Allen.Hubbe@emc.com>
      Signed-off-by: NJon Mason <jdmason@kudzu.us>
      ec110bc7
  6. 01 7月, 2015 16 次提交
  7. 30 6月, 2015 1 次提交
  8. 29 6月, 2015 1 次提交
  9. 28 6月, 2015 1 次提交
  10. 27 6月, 2015 3 次提交
  11. 26 6月, 2015 9 次提交
    • R
      arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952
      Ross Zwisler 提交于
      Based on an original patch by Ross Zwisler [1].
      
      Writes to persistent memory have the potential to be posted to cpu
      cache, cpu write buffers, and platform write buffers (memory controller)
      before being committed to persistent media.  Provide apis,
      memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
      pmem and assert that it is durable in PMEM (a persistent linear address
      range).  A '__pmem' attribute is added so sparse can track proper usage
      of pointers to pmem.
      
      This continues the status quo of pmem being x86 only for 4.2, but
      reworks to ioremap, and wider implementation of memremap() will enable
      other archs in 4.3.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      [djbw: various reworks]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      61031952
    • T
      libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3
      Toshi Kani 提交于
      Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
      under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.
      
      An example of numa_node values on a 2-socket system with a single
      NVDIMM range on each socket is shown below.
        /sys/bus/nd/devices
        |-- btt0.0/numa_node:0
        |-- btt1.0/numa_node:1
        |-- btt1.1/numa_node:1
        |-- namespace0.0/numa_node:0
        |-- namespace1.0/numa_node:1
        |-- region0/numa_node:0
        |-- region1/numa_node:1
      
      These numa_node files are then linked under the block class of
      their device names.
        /sys/class/block/pmem0/device/numa_node:0
        /sys/class/block/pmem1s/device/numa_node:1
      
      This enables numactl(8) to accept 'block:' and 'file:' paths of
      pmem and btt devices as shown in the examples below.
        numactl --preferred block:pmem0 --show
        numactl --preferred file:/dev/pmem1s --show
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      74ae66c3
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • T
      acpi: Add acpi_map_pxm_to_online_node() · 99759869
      Toshi Kani 提交于
      The kernel initializes CPU & memory's NUMA topology from ACPI
      SRAT table.  Some other ACPI tables, such as NFIT and DMAR, also
      contain proximity IDs for their device's NUMA topology.  This
      information can be used to improve performance of these devices.
      
      This patch introduces acpi_map_pxm_to_online_node(), which is
      similar to acpi_map_pxm_to_node(), but always returns an online
      node.  When the mapped node from a given proximity ID is offline,
      it looks up the node distance table and returns the nearest
      online node.
      
      ACPI device drivers, which are called after the NUMA initialization
      has completed in the kernel, can call this interface to obtain their
      device NUMA topology from ACPI tables.  Such drivers do not have to
      deal with offline nodes.  A node may be offline when a device
      proximity ID is unique, SRAT memory entry does not exist, or NUMA is
      disabled, ex. "numa=off" on x86.
      
      This patch also moves the pxm range check from acpi_get_node() to
      acpi_map_pxm_to_node().
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com&gt;>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      99759869
    • D
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams 提交于
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58138820
    • R
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler 提交于
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      047fc8a1
    • V
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma 提交于
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5212e11f
    • M
      Revert "block, dm: don't copy bios for request clones" · 78d8e58a
      Mike Snitzer 提交于
      This reverts commit 5f1b670d.
      
      Justification for revert as reported in this dm-devel post:
      https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html
      
      this change should not be pushed to mainline yet.
      
      Firstly, Christoph has a newer version of the patch that fixes silent
      data corruption problem:
        https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html
      
      And the new version still depends on LLDDs to always complete requests
      to the end when error happens, while block API doesn't enforce such a
      requirement. If the assumption is ever broken, the inconsistency between
      request and bio (e.g. rq->__sector and rq->bio) will cause silent data
      corruption:
        https://www.redhat.com/archives/dm-devel/2015-June/msg00022.htmlReported-by: NJunichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      78d8e58a
    • R
      lib/string.c: introduce strreplace() · 94df2904
      Rasmus Villemoes 提交于
      Strings are sometimes sanitized by replacing a certain character (often
      '/') by another (often '!').  In a few places, this is done the same way
      Schlemiel the Painter would do it.  Others are slightly smarter but still
      do multiple strchr() calls.  Introduce strreplace() to do this using a
      single function call and a single pass over the string.
      
      One would expect the return value to be one of three things: void, s, or
      the number of replacements made.  I chose the fourth, returning a pointer
      to the end of the string.  This is more likely to be useful (for example
      allowing the caller to avoid a strlen call).
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94df2904