1. 29 7月, 2015 4 次提交
    • M
      powerpc: Don't negate error in syscall_set_return_value() · 1b1a3702
      Michael Ellerman 提交于
      Currently the only caller of syscall_set_return_value() is seccomp
      filter, which is not enabled on powerpc.
      
      This means we have not noticed that our implementation of
      syscall_set_return_value() negates error, even though the value passed
      in is already negative.
      
      So remove the negation in syscall_set_return_value(), and expect the
      caller to do it like all other implementations do.
      
      Also add a comment about the ccr handling.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      1b1a3702
    • M
      powerpc: Drop unused syscall_get_error() · 2923e6d5
      Michael Ellerman 提交于
      syscall_get_error() is unused, and never has been.
      
      It's also probably wrong, as it negates r3 before returning it, but that
      depends on what the caller is expecting.
      
      It also doesn't deal with compat, and doesn't deal with TIF_NOERROR.
      
      Although we could fix those, until it has a caller and it's clear what
      semantics the caller wants it's just untested code. So drop it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      2923e6d5
    • M
      powerpc/kernel: Change the do_syscall_trace_enter() API · d3837414
      Michael Ellerman 提交于
      The API for calling do_syscall_trace_enter() is currently sensible
      enough, it just returns the (modified) syscall number.
      
      However once we enable seccomp filter it will get more complicated. When
      seccomp filter runs, the seccomp kernel code (via SECCOMP_RET_ERRNO), or
      a ptracer (via SECCOMP_RET_TRACE), may reject the syscall and *may* or may
      *not* set a return value in r3.
      
      That means the assembler that calls do_syscall_trace_enter() can not
      blindly return ENOSYS, it needs to only return ENOSYS if a return value
      has not already been set.
      
      There is no way to implement that logic with the current API. So change
      the do_syscall_trace_enter() API to make it deal with the return code
      juggling, and the assembler can then just return whatever return code it
      is given.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      d3837414
    • M
      powerpc/kernel: Switch to using MAX_ERRNO · c3525940
      Michael Ellerman 提交于
      Currently on powerpc we have our own #define for the highest (negative)
      errno value, called _LAST_ERRNO. This is defined to be 516, for reasons
      which are not clear.
      
      The generic code, and x86, use MAX_ERRNO, which is defined to be 4095.
      
      In particular seccomp uses MAX_ERRNO to restrict the value that a
      seccomp filter can return.
      
      Currently with the mismatch between _LAST_ERRNO and MAX_ERRNO, a seccomp
      tracer wanting to return 600, expecting it to be seen as an error, would
      instead find on powerpc that userspace sees a successful syscall with a
      return value of 600.
      
      To avoid this inconsistency, switch powerpc to use MAX_ERRNO.
      
      We are somewhat confident that generic syscalls that can return a
      non-error value above negative MAX_ERRNO have already been updated to
      use force_successful_syscall_return().
      
      I have also checked all the powerpc specific syscalls, and believe that
      none of them expect to return a non-error value between -MAX_ERRNO and
      -516. So this change should be safe ...
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      c3525940
  2. 27 7月, 2015 1 次提交
  3. 25 7月, 2015 2 次提交
  4. 23 7月, 2015 3 次提交
    • P
      powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_* · 01c9348c
      Paul Mackerras 提交于
      The hardware RNG on POWER8 and POWER7+ can be relatively slow, since
      it can only supply one 64-bit value per microsecond.  Currently we
      read it in arch_get_random_long(), but that slows down reading from
      /dev/urandom since the code in random.c calls arch_get_random_long()
      for every longword read from /dev/urandom.
      
      Since the hardware RNG supplies high-quality entropy on every read, it
      matches the semantics of arch_get_random_seed_long() better than those
      of arch_get_random_long().  Therefore this commit makes the code use
      the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int}
      and not for arch_get_random_{long,int}.
      
      This won't affect any other PowerPC-based platforms because none of
      them currently support a hardware RNG.  To make it clear that the
      ppc_md function pointer is used for arch_get_random_seed_*, we rename
      it from get_random_long to get_random_seed.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      01c9348c
    • T
      powerpc/rtas: Introduce rtas_get_sensor_fast() for IRQ handlers · 1c2cb594
      Thomas Huth 提交于
      The EPOW interrupt handler uses rtas_get_sensor(), which in turn
      uses rtas_busy_delay() to wait for RTAS becoming ready in case it
      is necessary. But rtas_busy_delay() is annotated with might_sleep()
      and thus may not be used by interrupts handlers like the EPOW handler!
      This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
      enabled:
      
       BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
       Call Trace:
       [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
       [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
       [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
       [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
       [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
       [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
       [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
       [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
       [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
       [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
       [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
       [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
       [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180
      
      Fix this issue by introducing a new rtas_get_sensor_fast() function
      that does not use rtas_busy_delay() - and thus can only be used for
      sensors that do not cause a BUSY condition - known as "fast" sensors.
      
      The EPOW sensor is defined to be "fast" in sPAPR - mpe.
      
      Fixes: 587f83e8 ("powerpc/pseries: Use rtas_get_sensor in RAS code")
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1c2cb594
    • T
      powerpc/rtas: Replace magic values with defines · 9ef03193
      Thomas Huth 提交于
      rtas.h already has some nice #defines for RTAS return status
      codes - let's use them instead of hard-coded "magic" values!
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9ef03193
  5. 21 7月, 2015 3 次提交
  6. 16 7月, 2015 5 次提交
  7. 13 7月, 2015 20 次提交
    • G
      powerpc/powernv: Unfreeze VF PE on releasing it · f951e510
      Gavin Shan 提交于
      When releasing PE for SRIOV VF, the PE is forced to be frozen
      wrongly. When the same PE is picked for another VF, it won't
      work anyhow. The patch fixes the issue by unfreezing, not
      freezing the VF PE when releasing it.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f951e510
    • G
      powerpc/powernv: Include VF PE in PELTV of PF PE · 283e2d8a
      Gavin Shan 提交于
      The PELTV of PF PE should include VF PE, which is missed by current
      code, so that the VF PE is frozen automatically when freezing PF PE.
      The patch fixes the PELTV of PF PE to include VF PE.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      283e2d8a
    • G
      powerpc/powernv: Pick M64 PEs based on BARs · 26ba248d
      Gavin Shan 提交于
      On PHB3, PE might be reserved in advance to reflect the M64 segments
      consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI
      devices included in the PE. The PE is picked based on M64 BARs instead
      of the bridge's M64 windows, which might include VF BARs. Otherwise,
      wrong PE could be picked.
      
      The patch calculates the used M64 segments and PE numbers according to
      the M64 BARs, excluding VF BARs, of PCI devices in one particular PE,
      instead of the bridge's M64 windows. Then the right PE number is picked.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      26ba248d
    • G
      powerpc/powernv: Boolean argument for pnv_ioda_setup_bus_PE() · d1203852
      Gavin Shan 提交于
      The patch changes the type of last argument of pnv_ioda_setup_bus_PE()
      and phb::pick_m64_pe() to boolean. No functional change.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d1203852
    • G
      powerpc/powernv: Reserve M64 PEs based on BARs · 96a2f92b
      Gavin Shan 提交于
      On PHB3, some PEs might be reserved in advance to reflect the M64
      segments consumed by those PEs. We're reserving PEs based on the
      M64 window of root port, which might contain VF BAR. The PEs for
      VFs are allocated dynamically, not reserved based on the consumed
      M64 segments. So the M64 window of root port isn't reliable for
      the task. Instead, we go through M64 BARs (VF BARs excluded) of
      PCI devices under the specified root bus and reserve PEs accordingly,
      as the patch does.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      96a2f92b
    • G
      powerpc/powernv: Allow to reserve one PE for multiple times · e9dc4d7f
      Gavin Shan 提交于
      The PE numbers are reserved according to root port's M64 window,
      which is aligned to M64 segment finely. So one PE shouldn't be
      reserved for multiple times. We will reserve PE numbers according
      to the M64 BARs of PCI device in subsequent patches, which aren't
      aligned to M64 segment size finely. It means one particular PE
      could be reserved for multiple times.
      
      The patch allows one PE to be reserved for multiple times and we
      print the warning message at debugging level.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e9dc4d7f
    • A
      powerpc: Remove mtmsrd(), use existing mtmsr() · 1c539731
      Anton Blanchard 提交于
      mtmsr() does the right thing on 32bit and 64bit, so use it everywhere.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1c539731
    • M
      powerpc: Add macros for the ibm_architecture_vec[] lengths · e8a4fd0a
      Michael Ellerman 提交于
      The encoding of the lengths in the ibm_architecture_vec array is
      "interesting" to say the least. It's non-obvious how the number of bytes
      we provide relates to the length value.
      
      In fact we already got it wrong once, see 11e9ed43 "Fix up
      ibm_architecture_vec definition".
      
      So add some macros to make it (hopefully) clearer. These at least have
      the property that the integer present in the code is equal to the number
      of bytes that follows it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      e8a4fd0a
    • B
      powerpc/iommu: Support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask · 817820b0
      Benjamin Herrenschmidt 提交于
      This patch adds the ability to the DMA direct ops to fallback to the IOMMU
      ops for coherent alloc/free if the coherent mask of the device isn't
      suitable for accessing the direct DMA space and the device also happens
      to have an active IOMMU table.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      817820b0
    • B
      powerpc/iommu: Cleanup setting of DMA base/offset · e91c2511
      Benjamin Herrenschmidt 提交于
      Now that the table and the offset can co-exist, we no longer need
      to flip/flop, we can just establish both once at boot time.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e91c2511
    • B
      powerpc/iommu: Remove dma_data union · 2db4928b
      Benjamin Herrenschmidt 提交于
      To support "hybrid" DMA ops in a subsequent patch, we will need both
      a direct DMA offset and an iommu pointer. Those are currently exclusive
      (a union), so change them to be separate fields.
      
      While there, also type iommu_table_base properly and make exist only
      on CONFIG_PPC64 since it's not referenced on 32-bit at all.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2db4928b
    • R
      cxl: use more common format specifier · de369538
      Rasmus Villemoes 提交于
      A precision of 16 (%.16llx) has the same effect as a field width of 16
      along with passing the 0 flag (%016llx), but the latter is much more
      common in the kernel tree. Update cxl to use that.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de369538
    • R
      cxl: Add explicit precision specifiers · 80c394fa
      Rasmus Villemoes 提交于
      C99 says that a precision given as simply '.' with no following digits
      or * should be interpreted as 0. The kernel's printf implementation,
      however, treats this case as if the precision was omitted. C99 also
      says that if both the precision and value are 0, no digits should be
      printed. Even if the kernel followed C99 to the letter, I don't think
      that would be particularly useful in these cases. For consistency with
      most other format strings in the file, use an explicit precision of 16
      and add a 0x prefix.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      80c394fa
    • L
      Linux 4.2-rc2 · bc0195aa
      Linus Torvalds 提交于
      bc0195aa
    • L
      Revert "drm/i915: Use crtc_state->active in primary check_plane func" · 01e2d062
      Linus Torvalds 提交于
      This reverts commit dec4f799.
      
      Jörg Otte reports a NULL pointder dereference due to this commit, as
      'crtc_state' very much can be NULL:
      
              crtc_state = state->base.state ?
                      intel_atomic_get_crtc_state(state->base.state, intel_crtc) : NULL;
      
      So the change to test 'crtc_state->base.active' cannot possibly be
      correct as-is.
      
      There may be some other minimal fix (like just checking crtc_state for
      NULL), but I'm just reverting it now for the rc2 release, and people
      like Daniel Vetter who actually know this code will figure out what the
      right solution is in the longer term.
      Reported-and-bisected-by: NJörg Otte <jrg.otte@gmail.com>
      Cc: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      CC: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01e2d062
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · c83727a6
      Linus Torvalds 提交于
      Pull VFS fixes from Al Viro:
       "Fixes for this cycle regression in overlayfs and a couple of
        long-standing (== all the way back to 2.6.12, at least) bugs"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        freeing unlinked file indefinitely delayed
        fix a braino in ovl_d_select_inode()
        9p: don't leave a half-initialized inode sitting around
      c83727a6
    • L
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 7fbb58a0
      Linus Torvalds 提交于
      Pull MIPS fixes from Ralf Baechle:
       "A fair number of 4.2 fixes also because Markos opened the flood gates.
      
         - Patch up the math used calculate the location for the page bitmap.
      
         - The FDC (Not what you think, FDC stands for Fast Debug Channel) IRQ
           around was causing issues on non-Malta platforms, so move the code
           to a Malta specific location.
      
         - A spelling fix replicated through several files.
      
         - Fix to the emulation of an R2 instruction for R6 cores.
      
         - Fix the JR emulation for R6.
      
         - Further patching of mindless 64 bit issues.
      
         - Ensure the kernel won't crash on CPUs with L2 caches with >= 8
           ways.
      
         - Use compat_sys_getsockopt for O32 ABI on 64 bit kernels.
      
         - Fix cache flushing for multithreaded cores.
      
         - A build fix"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: O32: Use compat_sys_getsockopt.
        MIPS: c-r4k: Extend way_string array
        MIPS: Pistachio: Support CDMM & Fast Debug Channel
        MIPS: Malta: Make GIC FDC IRQ workaround Malta specific
        MIPS: c-r4k: Fix cache flushing for MT cores
        Revert "MIPS: Kconfig: Disable SMP/CPS for 64-bit"
        MIPS: cps-vec: Use macros for various arithmetics and memory operations
        MIPS: kernel: cps-vec: Replace KSEG0 with CKSEG0
        MIPS: kernel: cps-vec: Use ta0-ta3 pseudo-registers for 64-bit
        MIPS: kernel: cps-vec: Replace mips32r2 ISA level with mips64r2
        MIPS: kernel: cps-vec: Replace 'la' macro with PTR_LA
        MIPS: kernel: smp-cps: Fix 64-bit compatibility errors due to pointer casting
        MIPS: Fix erroneous JR emulation for MIPS R6
        MIPS: Fix branch emulation for BLTC and BGEC instructions
        MIPS: kernel: traps: Fix broken indentation
        MIPS: bootmem: Don't use memory holes for page bitmap
        MIPS: O32: Do not handle require 32 bytes from the stack to be readable.
        MIPS, CPUFREQ: Fix spelling of Institute.
        MIPS: Lemote 2F: Fix build caused by recent mass rename.
      7fbb58a0
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1daa1cfb
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
      
       - the high latency PIT detection fix, which slipped through the cracks
         for rc1
      
       - a regression fix for the early printk mechanism
      
       - the x86 part to plug irq/vector related hotplug races
      
       - move the allocation of the espfix pages on cpu hotplug to non atomic
         context.  The current code triggers a might_sleep() warning.
      
       - a series of KASAN fixes addressing boot crashes and usability
      
       - a trivial typo fix for Kconfig help text
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/kconfig: Fix typo in the CONFIG_CMDLINE_BOOL help text
        x86/irq: Retrieve irq data after locking irq_desc
        x86/irq: Use proper locking in check_irq_vectors_for_cpu_disable()
        x86/irq: Plug irq vector hotplug race
        x86/earlyprintk: Allow early_printk() to use console style parameters like '115200n8'
        x86/espfix: Init espfix on the boot CPU side
        x86/espfix: Add 'cpu' parameter to init_espfix_ap()
        x86/kasan: Move KASAN_SHADOW_OFFSET to the arch Kconfig
        x86/kasan: Add message about KASAN being initialized
        x86/kasan: Fix boot crash on AMD processors
        x86/kasan: Flush TLBs after switching CR3
        x86/kasan: Fix KASAN shadow region page tables
        x86/init: Clear 'init_level4_pgt' earlier
        x86/tsc: Let high latency PIT fail fast in quick_pit_calibrate()
      1daa1cfb
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7b732169
      Linus Torvalds 提交于
      Pull timer fixes from Thomas Gleixner:
       "This update from the timer departement contains:
      
         - A series of patches which address a shortcoming in the tick
           broadcast code.
      
           If the broadcast device is not available or an hrtimer emulated
           broadcast device, some of the original assumptions lead to boot
           failures.  I rather plugged all of the corner cases instead of only
           addressing the issue reported, so the change got a little larger.
      
           Has been extensivly tested on x86 and arm.
      
         - Get rid of the last holdouts using do_posix_clock_monotonic_gettime()
      
         - A regression fix for the imx clocksource driver
      
         - An update to the new state callbacks mechanism for clockevents.
           This is required to simplify the conversion, which will take place
           in 4.3"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        tick/broadcast: Prevent NULL pointer dereference
        time: Get rid of do_posix_clock_monotonic_gettime
        cris: Replace do_posix_clock_monotonic_gettime()
        tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build
        tick/broadcast: Handle spurious interrupts gracefully
        tick/broadcast: Check for hrtimer broadcast active early
        tick/broadcast: Return busy when IPI is pending
        tick/broadcast: Return busy if periodic mode and hrtimer broadcast
        tick/broadcast: Move the check for periodic mode inside state handling
        tick/broadcast: Prevent deep idle if no broadcast device available
        tick/broadcast: Make idle check independent from mode and config
        tick/broadcast: Sanity check the shutdown of the local clock_event
        tick/broadcast: Prevent hrtimer recursion
        clockevents: Allow set-state callbacks to be optional
        clocksource/imx: Define clocksource for mx27
      7b732169
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c4bc680c
      Linus Torvalds 提交于
      Pull irq fix from Thomas Gleixner:
       "A single fix for a cpu hotplug race vs. interrupt descriptors:
      
        Prevent irq setup/teardown across the cpu starting/dying parts of cpu
        hotplug so that the starting/dying cpu has a stable view of the
        descriptor space.  This has been an issue for all architectures in the
        cpu dying phase, where interrupts are migrated away from the dying
        cpu.  In the starting phase its mostly a x86 issue vs the vector space
        update"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        hotplug: Prevent alloc/free of irq descriptors during cpu up/down
      c4bc680c
  8. 12 7月, 2015 2 次提交
    • A
      freeing unlinked file indefinitely delayed · 75a6f82a
      Al Viro 提交于
      	Normally opening a file, unlinking it and then closing will have
      the inode freed upon close() (provided that it's not otherwise busy and
      has no remaining links, of course).  However, there's one case where that
      does *not* happen.  Namely, if you open it by fhandle with cold dcache,
      then unlink() and close().
      
      	In normal case you get d_delete() in unlink(2) notice that dentry
      is busy and unhash it; on the final dput() it will be forcibly evicted from
      dcache, triggering iput() and inode removal.  In this case, though, we end
      up with *two* dentries - disconnected (created by open-by-fhandle) and
      regular one (used by unlink()).  The latter will have its reference to inode
      dropped just fine, but the former will not - it's considered hashed (it
      is on the ->s_anon list), so it will stay around until the memory pressure
      will finally do it in.  As the result, we have the final iput() delayed
      indefinitely.  It's trivial to reproduce -
      
      void flush_dcache(void)
      {
              system("mount -o remount,rw /");
      }
      
      static char buf[20 * 1024 * 1024];
      
      main()
      {
              int fd;
              union {
                      struct file_handle f;
                      char buf[MAX_HANDLE_SZ];
              } x;
              int m;
      
              x.f.handle_bytes = sizeof(x);
              chdir("/root");
              mkdir("foo", 0700);
              fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
              close(fd);
              name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
              flush_dcache();
              fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
              unlink("foo/bar");
              write(fd, buf, sizeof(buf));
              system("df .");			/* 20Mb eaten */
              close(fd);
              system("df .");			/* should've freed those 20Mb */
              flush_dcache();
              system("df .");			/* should be the same as #2 */
      }
      
      will spit out something like
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 303843      1131 100% /
      Filesystem     1K-blocks   Used Available Use% Mounted on
      /dev/root         322023 283282     21692  93% /
      - inode gets freed only when dentry is finally evicted (here we trigger
      than by remount; normally it would've happened in response to memory
      pressure hell knows when).
      
      Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
      Acked-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      75a6f82a
    • A
      fix a braino in ovl_d_select_inode() · 9391dd00
      Al Viro 提交于
      when opening a directory we want the overlayfs inode, not one from
      the topmost layer.
      Reported-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Tested-By: NAndrey Jr. Melnikov <temnota.am@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9391dd00