1. 28 4月, 2014 16 次提交
    • L
      powerpc/pseries: Protect remove_memory() with device hotplug lock · 42dbfc86
      Li Zhong 提交于
      While testing memory hot-remove, I found following dead lock:
      
      Process #1141 is drmgr, trying to remove some memory, i.e. memory499.
      It holds the memory_hotplug_mutex, and blocks when trying to remove file
      "online" under dir memory499, in kernfs_drain(), at
              wait_event(root->deactivate_waitq,
                         atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
      
      Process #1120 is trying to online memory499 by
         echo 1 > memory499/online
      
      In .kernfs_fop_write, it uses kernfs_get_active() to increase
      &kn->active, thus blocking process #1141. While itself is blocked later
      when trying to acquire memory_hotplug_mutex, which is held by process
      
      The backtrace of both processes are shown below:
      
      [<c000000001b18600>] 0xc000000001b18600
      [<c000000000015044>] .__switch_to+0x144/0x200
      [<c000000000263ca4>] .online_pages+0x74/0x7b0
      [<c00000000055b40c>] .memory_subsys_online+0x9c/0x150
      [<c00000000053cbe8>] .device_online+0xb8/0x120
      [<c00000000053cd04>] .online_store+0xb4/0xc0
      [<c000000000538ce4>] .dev_attr_store+0x64/0xa0
      [<c00000000030f4ec>] .sysfs_kf_write+0x7c/0xb0
      [<c00000000030e574>] .kernfs_fop_write+0x154/0x1e0
      [<c000000000268450>] .vfs_write+0xe0/0x260
      [<c000000000269144>] .SyS_write+0x64/0x110
      [<c000000000009ffc>] syscall_exit+0x0/0x7c
      
      [<c000000001b18600>] 0xc000000001b18600
      [<c000000000015044>] .__switch_to+0x144/0x200
      [<c00000000030be14>] .__kernfs_remove+0x204/0x300
      [<c00000000030d428>] .kernfs_remove_by_name_ns+0x68/0xf0
      [<c00000000030fb38>] .sysfs_remove_file_ns+0x38/0x60
      [<c000000000539354>] .device_remove_attrs+0x54/0xc0
      [<c000000000539fd8>] .device_del+0x158/0x250
      [<c00000000053a104>] .device_unregister+0x34/0xa0
      [<c00000000055bc14>] .unregister_memory_section+0x164/0x170
      [<c00000000024ee18>] .__remove_pages+0x108/0x4c0
      [<c00000000004b590>] .arch_remove_memory+0x60/0xc0
      [<c00000000026446c>] .remove_memory+0x8c/0xe0
      [<c00000000007f9f4>] .pseries_remove_memblock+0xd4/0x160
      [<c00000000007fcfc>] .pseries_memory_notifier+0x27c/0x290
      [<c0000000008ae6cc>] .notifier_call_chain+0x8c/0x100
      [<c0000000000d858c>] .__blocking_notifier_call_chain+0x6c/0xe0
      [<c00000000071ddec>] .of_property_notify+0x7c/0xc0
      [<c00000000071ed3c>] .of_update_property+0x3c/0x1b0
      [<c0000000000756cc>] .ofdt_write+0x3dc/0x740
      [<c0000000002f60fc>] .proc_reg_write+0xac/0x110
      [<c000000000268450>] .vfs_write+0xe0/0x260
      [<c000000000269144>] .SyS_write+0x64/0x110
      [<c000000000009ffc>] syscall_exit+0x0/0x7c
      
      This patch uses lock_device_hotplug() to protect remove_memory() called
      in pseries_remove_memblock(), which is also stated before function
      remove_memory():
      
       * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
       * and online/offline operations before this call, as required by
       * try_offline_node().
       */
      void __ref remove_memory(int nid, u64 start, u64 size)
      
      With this lock held, the other process(#1120 above) trying to online the
      memory block will retry the system call when calling
      lock_device_hotplug_sysfs(), and finally find No such device error.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      42dbfc86
    • A
    • A
      powerpc/powernv: Create OPAL sglist helper functions and fix endian issues · 3441f04b
      Anton Blanchard 提交于
      We have two copies of code that creates an OPAL sg list. Consolidate
      these into a common set of helpers and fix the endian issues.
      
      The flash interface embedded a version number in the num_entries
      field, whereas the dump interface did did not. Since versioning
      wasn't added to the flash interface and it is impossible to add
      this in a backwards compatible way, just remove it.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3441f04b
    • A
      powerpc/powernv: Fix little endian issues in OPAL error log code · 14ad0c58
      Anton Blanchard 提交于
      Fix little endian issues with the OPAL error log code.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      14ad0c58
    • A
      powerpc/powernv: Fix little endian issues with opal_do_notifier calls · 56b4c993
      Anton Blanchard 提交于
      The bitmap in opal_poll_events and opal_handle_interrupt is
      big endian, so we need to byteswap it on little endian builds.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56b4c993
    • A
      powerpc/powernv: Use uint64_t instead of size_t in OPAL APIs · 2bad7423
      Anton Blanchard 提交于
      Using size_t in our APIs is asking for trouble, especially
      when some OPAL calls use size_t pointers.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2bad7423
    • W
      powerpc/powernv: Release the refcount for pci_dev · 4966bfa1
      Wei Yang 提交于
      On PowerNV platform, we are holding an unnecessary refcount on a pci_dev, which
      leads to the pci_dev is not destroyed when hotplugging a pci device.
      
      This patch release the unnecessary refcount.
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4966bfa1
    • W
      powerpc/powernv: Reduce multi-hit of iommu_add_device() · 3f28c5af
      Wei Yang 提交于
      During the EEH hotplug event, iommu_add_device() will be invoked three times
      and two of them will trigger warning or error.
      
      The three times to invoke the iommu_add_device() are:
      
          pci_device_add
             ...
             set_iommu_table_base_and_group   <- 1st time, fail
          device_add
             ...
             tce_iommu_bus_notifier           <- 2nd time, succees
          pcibios_add_pci_devices
             ...
             pcibios_setup_bus_devices        <- 3rd time, re-attach
      
      The first time fails, since the dev->kobj->sd is not initialized. The
      dev->kobj->sd is initialized in device_add().
      The third time's warning is triggered by the re-attach of the iommu_group.
      
      After applying this patch, the error
      
          iommu_tce: 0003:05:00.0 has not been added, ret=-14
      
      and the warning
      
          [  204.123609] ------------[ cut here ]------------
          [  204.123645] WARNING: at arch/powerpc/kernel/iommu.c:1125
          [  204.123680] Modules linked in: xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT bnep bluetooth 6lowpan_iphc rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc mlx4_ib ib_sa ib_mad ib_core ib_addr ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnx2x tg3 mlx4_core nfsd ptp mdio ses libcrc32c nfs_acl enclosure be2net pps_core shpchp lockd kvm uinput sunrpc binfmt_misc lpfc scsi_transport_fc ipr scsi_tgt
          [  204.124356] CPU: 18 PID: 650 Comm: eehd Not tainted 3.14.0-rc5yw+ #102
          [  204.124400] task: c0000027ed485670 ti: c0000027ed50c000 task.ti: c0000027ed50c000
          [  204.124453] NIP: c00000000003cf80 LR: c00000000006c648 CTR: c00000000006c5c0
          [  204.124506] REGS: c0000027ed50f440 TRAP: 0700   Not tainted  (3.14.0-rc5yw+)
          [  204.124558] MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 88008084  XER: 20000000
          [  204.124682] CFAR: c00000000006c644 SOFTE: 1
          GPR00: c00000000006c648 c0000027ed50f6c0 c000000001398380 c0000027ec260300
          GPR04: c0000027ea92c000 c00000000006ad00 c0000000016e41b0 0000000000000110
          GPR08: c0000000012cd4c0 0000000000000001 c0000027ec2602ff 0000000000000062
          GPR12: 0000000028008084 c00000000fdca200 c0000000000d1d90 c0000027ec281a80
          GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
          GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
          GPR24: 000000005342697b 0000000000002906 c000001fe6ac9800 c000001fe6ac9800
          GPR28: 0000000000000000 c0000000016e3a80 c0000027ea92c090 c0000027ea92c000
          [  204.125353] NIP [c00000000003cf80] .iommu_add_device+0x30/0x1f0
          [  204.125399] LR [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
          [  204.125443] Call Trace:
          [  204.125464] [c0000027ed50f6c0] [c0000027ed50f750] 0xc0000027ed50f750 (unreliable)
          [  204.125526] [c0000027ed50f750] [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
          [  204.125588] [c0000027ed50f7d0] [c000000000069cc8] .pnv_pci_dma_dev_setup+0x78/0x340
          [  204.125650] [c0000027ed50f870] [c000000000044408] .pcibios_setup_device+0x88/0x2f0
          [  204.125712] [c0000027ed50f940] [c000000000046040] .pcibios_setup_bus_devices+0x60/0xd0
          [  204.125774] [c0000027ed50f9c0] [c000000000043acc] .pcibios_add_pci_devices+0xdc/0x1c0
          [  204.125837] [c0000027ed50fa50] [c00000000086f970] .eeh_reset_device+0x36c/0x4f0
          [  204.125939] [c0000027ed50fb20] [c00000000003a2d8] .eeh_handle_normal_event+0x448/0x480
          [  204.126068] [c0000027ed50fbc0] [c00000000003a35c] .eeh_handle_event+0x4c/0x340
          [  204.126192] [c0000027ed50fc80] [c00000000003a74c] .eeh_event_handler+0xfc/0x1b0
          [  204.126319] [c0000027ed50fd30] [c0000000000d1ea0] .kthread+0x110/0x130
          [  204.126430] [c0000027ed50fe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
          [  204.126556] Instruction dump:
          [  204.126610] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 7c7e1b78 60000000
          [  204.126787] 60000000 e87e0298 3143ffff 7d2a1910 <0b090000> 2fa90000 40de00c8 ebfe0218
          [  204.126966] ---[ end trace 6e7aefd80add2973 ]---
      
      are cleared.
      
      This patch removes iommu_add_device() in pnv_pci_ioda_dma_dev_setup(), which
      revert part of the change in commit d905c5df(PPC: POWERNV: move
      iommu_add_device earlier).
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3f28c5af
    • A
      powerpc/powernv: Fix little endian issues in OPAL flash code · cc146d1d
      Anton Blanchard 提交于
      With this patch I was able to update firmware on an LE kernel.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cc146d1d
    • B
      powerpc/powernv: Fix kexec races going back to OPAL · 298b34d7
      Benjamin Herrenschmidt 提交于
      We have a subtle race when sending CPUs back to OPAL on kexec.
      
      We mark them as "in real mode" right before we send them down. Once
      we've booted the new kernel, it might try to call opal_reinit_cpus()
      to change endianness, and that requires all CPUs to be spinning inside
      OPAL.
      
      However there is no synchronization here and we've observed cases
      where the returning CPUs hadn't established their new state inside
      OPAL before opal_reinit_cpus() is called, causing it to fail.
      
      The proper fix is to actually wait for them to go down all the way
      from the kexec'ing kernel.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      298b34d7
    • J
      powerpc/powernv: Check sysparam size before creation · 63aecfb2
      Joel Stanley 提交于
      The size of the sysparam sysfs files is determined from the device tree
      at boot. However the buffer is hard coded to 64 bytes. If we encounter a
      parameter that is larger than 64, or miss-parse the device tree, the
      buffer will overflow when reading or writing to the parameter.
      
      Check it at discovery time, and if the parameter is too large, do not
      create a sysfs entry for it.
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      63aecfb2
    • J
      16003d23
    • J
      powerpc/powernv: Check sysfs size before copying · 85390378
      Joel Stanley 提交于
      The sysparam code currently uses the userspace supplied number of
      bytes when memcpy()ing in to a local 64-byte buffer.
      
      Limit the maximum number of bytes by the size of the buffer.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      85390378
    • J
      powerpc/powernv: Use ssize_t for sysparam return values · b8569d23
      Joel Stanley 提交于
      The OPAL calls are returning int64_t values, which the sysparam code
      stores in an int, and the sysfs callback returns ssize_t. Make code a
      easier to read by consistently using ssize_t.
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b8569d23
    • J
      powerpc/powernv: Fix sysparam sysfs error handling · ba9a32b1
      Joel Stanley 提交于
      When a sysparam query in OPAL returned a negative value (error code),
      sysfs would spew out a decent chunk of memory; almost 64K more than
      expected. This was traced to a sign/unsigned mix up in the OPAL sysparam
      sysfs code at sys_param_show.
      
      The return value of sys_param_show is a ssize_t, calculated using
      
        return ret ? ret : attr->param_size;
      
      Alan Modra explains:
      
        "attr->param_size" is an unsigned int, "ret" an int, so the overall
        expression has type unsigned int.  Result is that ret is cast to
        unsigned int before being cast to ssize_t.
      
      Instead of using the ternary operator, set ret to the param_size if an
      error is not detected. The same bug exists in the sysfs write callback;
      this patch fixes it in the same way.
      
      A note on debugging this next time: on my system gcc will warn about
      this if compiled with -Wsign-compare, which is not enabled by -Wall,
      only -Wextra.
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ba9a32b1
    • L
      powerpc: Fix Oops in rtas_stop_self() · 4fb8d027
      Li Zhong 提交于
      commit 41dd03a9 may cause Oops in rtas_stop_self().
      
      The reason is that the rtas_args was moved into stack space. For a box
      with more that 4GB RAM, the stack could easily be outside 32bit range,
      but RTAS is 32bit.
      
      So the patch moves rtas_args away from stack by adding static before
      it.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable@vger.kernel.org # 3.14+
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4fb8d027
  2. 09 4月, 2014 7 次提交
  3. 08 4月, 2014 1 次提交
  4. 07 4月, 2014 4 次提交
  5. 24 3月, 2014 4 次提交
  6. 20 3月, 2014 5 次提交
  7. 12 3月, 2014 1 次提交
    • G
      of: Make device nodes kobjects so they show up in sysfs · 75b57ecf
      Grant Likely 提交于
      Device tree nodes are already treated as objects, and we already want to
      expose them to userspace which is done using the /proc filesystem today.
      Right now the kernel has to do a lot of work to keep the /proc view in
      sync with the in-kernel representation. If device_nodes are switched to
      be kobjects then the device tree code can be a whole lot simpler. It
      also turns out that switching to using /sysfs from /proc results in
      smaller code and data size, and the userspace ABI won't change if
      /proc/device-tree symlinks to /sys/firmware/devicetree/base.
      
      v7: Add missing sysfs_bin_attr_init()
      v6: Add __of_add_property() early init fixes from Pantelis
      v5: Rename firmware/ofw to firmware/devicetree
          Fix updating property values in sysfs
      v4: Fixed build error on Powerpc
          Fixed handling of dynamic nodes on powerpc
      v3: Fixed handling of duplicate attribute and child node names
      v2: switch to using sysfs bin_attributes which solve the problem of
          reporting incorrect property size.
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      Tested-by: NSascha Hauer <s.hauer@pengutronix.de>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Pantelis Antoniou <panto@antoniou-consulting.com>
      75b57ecf
  8. 11 3月, 2014 1 次提交
    • J
      mm: fix GFP_THISNODE callers and clarify · e97ca8e5
      Johannes Weiner 提交于
      GFP_THISNODE is for callers that implement their own clever fallback to
      remote nodes.  It restricts the allocation to the specified node and
      does not invoke reclaim, assuming that the caller will take care of it
      when the fallback fails, e.g.  through a subsequent allocation request
      without GFP_THISNODE set.
      
      However, many current GFP_THISNODE users only want the node exclusive
      aspect of the flag, without actually implementing their own fallback or
      triggering reclaim if necessary.  This results in things like page
      migration failing prematurely even when there is easily reclaimable
      memory available, unless kswapd happens to be running already or a
      concurrent allocation attempt triggers the necessary reclaim.
      
      Convert all callsites that don't implement their own fallback strategy
      to __GFP_THISNODE.  This restricts the allocation a single node too, but
      at the same time allows the allocator to enter the slowpath, wake
      kswapd, and invoke direct reclaim if necessary, to make the allocation
      happen when memory is full.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97ca8e5
  9. 07 3月, 2014 1 次提交
    • S
      powerpc/powernv Platform dump interface · c7e64b9c
      Stewart Smith 提交于
      This enables support for userspace to fetch and initiate FSP and
      Platform dumps from the service processor (via firmware) through sysfs.
      
      Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
      
      Flow:
        - We register for OPAL notification events.
        - OPAL sends new dump available notification.
        - We make information on dump available via sysfs
        - Userspace requests dump contents
        - We retrieve the dump via OPAL interface
        - User copies the dump data
        - userspace sends ack for dump
        - We send ACK to OPAL.
      
      sysfs files:
        - We add the /sys/firmware/opal/dump directory
        - echoing 1 (well, anything, but in future we may support
          different dump types) to /sys/firmware/opal/dump/initiate_dump
          will initiate a dump.
        - Each dump that we've been notified of gets a directory
          in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex,
          as this is what's used elsewhere to identify the dump).
        - Each dump has files: id, type, dump and acknowledge
          dump is binary and is the dump itself.
          echoing 'ack' to acknowledge (currently any string will do) will
          acknowledge the dump and it will soon after disappear from sysfs.
      
      OPAL APIs:
        - opal_dump_init()
        - opal_dump_info()
        - opal_dump_read()
        - opal_dump_ack()
        - opal_dump_resend_notification()
      
      Currently we are only ever notified for one dump at a time (until
      the user explicitly acks the current dump, then we get a notification
      of the next dump), but this kernel code should "just work" when OPAL
      starts notifying us of all the dumps present.
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c7e64b9c