1. 10 1月, 2013 21 次提交
    • T
      block: RCU free request_queue · 548bc8e1
      Tejun Heo 提交于
      RCU free request_queue so that blkcg_gq->q can be dereferenced under
      RCU lock.  This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      548bc8e1
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: export __blkg_prfill_rwstat() · b50da39f
      Tejun Heo 提交于
      Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
      Export it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      b50da39f
    • T
      blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ · 4d5e80a7
      Tejun Heo 提交于
      Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
      summing up stats from multiple blkgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      4d5e80a7
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: enable full blkcg hierarchy support · d02f7aa8
      Tejun Heo 提交于
      With the previous two patches, all cfqg scheduling decisions are based
      on vfraction and ready for hierarchy support.  The only thing which
      keeps the behavior flat is cfqg_flat_parent() which makes vfraction
      calculation consider all non-root cfqgs children of the root cfqg.
      
      Replace it with cfqg_parent() which returns the real parent.  This
      enables full blkcg hierarchy support for cfq-iosched.  For example,
      consider the following hierarchy.
      
              root
            /      \
         A:500      B:250
        /     \
       AA:500  AB:1000
      
      For simplicity, let's say all the leaf nodes have active tasks and are
      on service tree.  For each leaf node, vfraction would be
      
       AA: (500  / 1500) * (500 / 750) =~ 0.2222
       AB: (1000 / 1500) * (500 / 750) =~ 0.4444
        B:                 (250 / 750) =~ 0.3333
      
      and vdisktime will be distributed accordingly.  For more detail,
      please refer to Documentation/block/cfq-iosched.txt.
      
      v2: cfq-iosched.txt updated to describe group scheduling as suggested
          by Vivek.
      
      v3: blkio-controller.txt updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      d02f7aa8
    • T
      cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction · 41cad6ab
      Tejun Heo 提交于
      cfq_group_slice() calculates slice by taking a fraction of
      cfq_target_latency according to the ratio of cfqg->weight against
      service_tree->total_weight.  This currently works only because all
      cfqgs are treated to be at the same level.
      
      To prepare for proper hierarchy support, convert cfq_group_slice() to
      base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
      a fraction of 1 and represents the fraction allocated to the cfqg with
      hierarchy considered, the slice can be simply calculated by
      multiplying cfqg->vfraction to cfq_target_latency (with fixed point
      shift factored in).
      
      As vfraction calculation currently treats all non-root cfqgs as
      children of the root cfqg, this patch doesn't introduce noticeable
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      41cad6ab
    • T
      cfq-iosched: implement hierarchy-ready cfq_group charge scaling · 1d3650f7
      Tejun Heo 提交于
      Currently, cfqg charges are scaled directly according to cfqg->weight.
      Regardless of the number of active cfqgs or the amount of active
      weights, a given weight value always scales charge the same way.  This
      works fine as long as all cfqgs are treated equally regardless of
      their positions in the hierarchy, which is what cfq currently
      implements.  It can't work in hierarchical settings because the
      interpretation of a given weight value depends on where the weight is
      located in the hierarchy.
      
      This patch reimplements cfqg charge scaling so that it can be used to
      support hierarchy properly.  The scheme is fairly simple and
      light-weight.
      
      * When a cfqg is added to the service tree, v(disktime)weight is
        calculated.  It walks up the tree to root calculating the fraction
        it has in the hierarchy.  At each level, the fraction can be
        calculated as
      
          cfqg->weight / parent->level_weight
      
        By compounding these, the global fraction of vdisktime the cfqg has
        claim to - vfraction - can be determined.
      
      * When the cfqg needs to be charged, the charge is scaled inversely
        proportionally to the vfraction.
      
      The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
      representation as before; however, the smallest scaling factor is now
      1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
      was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
      scaling factor.
      
      While this shifts the global scale of vdisktime a bit, it doesn't
      change the relative relationships among cfqgs and the scheduling
      result isn't different.
      
      cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
      new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
      didn't have any relevance to vdisktime before and is unlikely to cause
      any visible behavior difference now especially as the scale shift
      isn't that large.
      
      As the new scheme now makes proper distinction between cfqg->weight
      and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
      root, both weights are now mapped to ->leaf_weight instead of the
      other way around.
      
      Because we're still using cfqg_flat_parent(), this patch shouldn't
      change the scheduling behavior in any noticeable way.
      
      v2: Beefed up comments on vfraction as requested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      1d3650f7
    • T
      cfq-iosched: implement cfq_group->nr_active and ->children_weight · 7918ffb5
      Tejun Heo 提交于
      To prepare for blkcg hierarchy support, add cfqg->nr_active and
      ->children_weight.  cfqg->nr_active counts the number of active cfqgs
      at the cfqg's level and ->children_weight is sum of weights of those
      cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
      children.
      
      The two values are updated when a cfqg enters and leaves the group
      service tree.  Unless the hierarchy is very deep, the added overhead
      should be negligible.
      
      Currently, the parent is determined using cfqg_flat_parent() which
      makes the root cfqg the parent of all other cfqgs.  This is to make
      the transition to hierarchy-aware scheduling gradual.  Scheduling
      logic will be converted to use cfqg->children_weight without actually
      changing the behavior.  When everything is ready,
      blkcg_weight_parent() will be replaced with proper parent function.
      
      This patch doesn't introduce any behavior chagne.
      
      v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7918ffb5
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
    • T
      blkcg: cosmetic updates to blkg_create() · 93e6d5d8
      Tejun Heo 提交于
      * Rename out_* labels to err_*.
      
      * Do ERR_PTR() conversion once in the error return path.
      
      This patch is cosmetic and to prepare for the hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      93e6d5d8
    • T
      blkcg: reorganize blkg_lookup_create() and friends · 86cde6b6
      Tejun Heo 提交于
      Reorganize such that
      
      * __blkg_lookup() takes bool param @update_hint to determine whether
        to update hint.
      
      * __blkg_lookup_create() no longer performs lookup before trying to
        create.  Renamed to blkg_create().
      
      * blkg_lookup_create() now performs lookup and then invokes
        blkg_create() if lookup fails.
      
      * root_blkg creation in blkcg_activate_policy() updated accordingly.
        Note that blkcg_activate_policy() no longer updates lookup hint if
        root_blkg already exists.
      
      Except for the last lookup hint bit which is immaterial, this is pure
      reorganization and doesn't introduce any visible behavior change.
      This is to prepare for proper hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      86cde6b6
    • T
      blkcg: fix minor bug in blkg_alloc() · 356d2e58
      Tejun Heo 提交于
      blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
      The latter test should have been on whether pol->pd_init_fn() exists.
      This doesn't cause actual problems because both blkcg policies
      implement pol->pd_init_fn().  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      356d2e58
    • V
      cfq-iosched: Print sync-noidle information in blktrace messages · b226e5c4
      Vivek Goyal 提交于
      Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
      whether queues is sync or async. Add one more character "N" to represent
      whether it is sync-noidle queue or sync queue. So now three different
      type of queues will look as follows.
      
      cfq1234S   --> sync queus
      cfq1234SN  --> sync noidle queue
      cfq1234A   --> Async queue
      
      Previously S/A classification was being printed only if group scheduling
      was enabled. This patch also makes sure that this classification is
      displayed even if group idling is disabled.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b226e5c4
    • V
      cfq-iosched: Get rid of unnecessary local variable · 1f23f121
      Vivek Goyal 提交于
      Use of local varibale "n" seems to be unnecessary. Remove it. This brings
      it inline with function __cfq_group_st_add(), which is also doing the
      similar operation of adding a group to a rb tree.
      
      No functionality change here.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f23f121
    • V
      cfq-iosched: Rename few functions related to selecting workload · 6d816ec7
      Vivek Goyal 提交于
      choose_service_tree() selects/sets both wl_class and wl_type.  Rename it to
      choose_wl_class_and_type() to make it very clear.
      
      cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
      it with choose_st(). So rename it to cfq_choose_wl_type() to make
      it clear what does it do.
      
      Just renaming. No functionality change.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6d816ec7
    • V
      cfq-iosched: Rename "service_tree" to "st" at some places · 34b98d03
      Vivek Goyal 提交于
      At quite a few places we use the keyword "service_tree". At some places,
      especially local variables, I have abbreviated it to "st".
      
      Also at couple of places moved binary operator "+" from beginning of line
      to end of previous line, as per Tejun's feedback.
      
      v2:
       Reverted most of the service tree name change based on Jeff Moyer's feedback.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      34b98d03
    • V
      cfq-iosched: More renaming to better represent wl_class and wl_type · 4d2ceea4
      Vivek Goyal 提交于
      Some more renaming. Again making the code uniform w.r.t use of
      wl_class/class to represent IO class (RT, BE, IDLE) and using
      wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).
      
      At places this patch shortens the string "workload" to "wl".
      Renamed "saved_workload" to "saved_wl_type". Renamed
      "saved_serving_class" to "saved_wl_class".
      
      For uniformity with "saved_wl_*" variables, renamed "serving_class"
      to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".
      
      Again, just trying to improve upon code uniformity and improve
      readability. No functional change.
      
      v2:
      - Restored the usage of keyword "service" based on Jeff Moyer's feedback.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4d2ceea4
    • V
      cfq-iosched: Properly name all references to IO class · 3bf10fea
      Vivek Goyal 提交于
      Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
      are calling workloads belonging to these classes as "prio". This gets
      very confusing as one starts to associate it with ioprio.
      
      So this patch just does bunch of renaming so that reading code becomes
      easier. All reference to RT, BE and IDLE workload are done using keyword
      "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
      using keyword "type".
      
      This makes me feel much better while I am reading the code. There is no
      functionality change due to this patch.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3bf10fea
  2. 03 1月, 2013 14 次提交
    • L
      Linux 3.8-rc2 · d1c3ed66
      Linus Torvalds 提交于
      d1c3ed66
    • L
      Merge branch 'fixes-for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds · d50403dc
      Linus Torvalds 提交于
      Pull LED fix from Bryan Wu.
      
      * 'fixes-for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds:
        leds: leds-gpio: set devm_gpio_request_one() flags param correctly
      d50403dc
    • J
      leds: leds-gpio: set devm_gpio_request_one() flags param correctly · 2d7c22f6
      Javier Martinez Canillas 提交于
      commit a99d76f9 leds: leds-gpio: use gpio_request_one
      
      changed the leds-gpio driver to use gpio_request_one() instead
      of gpio_request() + gpio_direction_output()
      
      Unfortunately, it also made a semantic change that breaks the
      leds-gpio driver.
      
      The gpio_request_one() flags parameter was set to:
      
      GPIOF_DIR_OUT | (led_dat->active_low ^ state)
      
      Since GPIOF_DIR_OUT is 0, the final flags value will just be the
      XOR'ed value of led_dat->active_low and state.
      
      This value were used to distinguish between HIGH/LOW output initial
      level and call gpio_direction_output() accordingly.
      
      With this new semantic gpio_request_one() will take the flags value
      of 1 as a configuration of input direction (GPIOF_DIR_IN) and will
      call gpio_direction_input() instead of gpio_direction_output().
      
      int gpio_request_one(unsigned gpio, unsigned long flags, const char *label)
      {
      ..
      	if (flags & GPIOF_DIR_IN)
      		err = gpio_direction_input(gpio);
      	else
      		err = gpio_direction_output(gpio,
      				(flags & GPIOF_INIT_HIGH) ? 1 : 0);
      ..
      }
      
      The right semantic is to evaluate led_dat->active_low ^ state and
      set the output initial level explicitly.
      Signed-off-by: NJavier Martinez Canillas <javier.martinez@collabora.co.uk>
      Reported-by: NArnaud Patard <arnaud.patard@rtp-net.org>
      Tested-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NBryan Wu <cooloney@gmail.com>
      2d7c22f6
    • L
      Merge git://www.linux-watchdog.org/linux-watchdog · ef05e9b9
      Linus Torvalds 提交于
      Pull watchdog fixes from Wim Van Sebroeck:
       "This fixes some small errors in the new da9055 driver, eliminates a
        compiler warning and adds DT support for the twl4030_wdt driver (so
        that we can have multiple watchdogs with DT on the omap platforms)."
      
      * git://www.linux-watchdog.org/linux-watchdog:
        watchdog: twl4030_wdt: add DT support
        watchdog: omap_wdt: eliminate unused variable and a compiler warning
        watchdog: da9055: Don't update wdt_dev->timeout in da9055_wdt_set_timeout error path
        watchdog: da9055: Fix invalid free of devm_ allocated data
      ef05e9b9
    • L
      Merge tag '3.8-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 080a62e2
      Linus Torvalds 提交于
      Pull PCI updates from Bjorn Helgaas:
       "Some fixes for v3.8.  They include a fix for the new SR-IOV sysfs
        management support, an expanded quirk for Ricoh SD card readers, a
        Stratus DMI quirk fix, and a PME polling fix."
      
      * tag '3.8-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: Reduce Ricoh 0xe822 SD card reader base clock frequency to 50MHz
        PCI/PM: Do not suspend port if any subordinate device needs PME polling
        PCI: Add PCIe Link Capability link speed and width names
        PCI: Work around Stratus ftServer broken PCIe hierarchy (fix DMI check)
        PCI: Remove spurious error for sriov_numvfs store and simplify flow
      080a62e2
    • D
      UAPI: Strip _UAPI prefix on header install no matter the whitespace · 8a7eab2b
      David Howells 提交于
      Commit 56c176c9 ("UAPI: strip the _UAPI prefix from header guards
      during header installation") strips the _UAPI prefix from header guards,
      but only if there's a single space between the cpp directive and the
      label.
      
      Make it more flexible and able to handle tabs and multiple white space
      characters.
      Signed-off-by: NDavid Howells <dhowell@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a7eab2b
    • D
      UAPI: Remove empty Kbuild files · 3d33fcc1
      David Howells 提交于
      Empty files can get deleted by the patch program, so remove empty Kbuild
      files and their links from the parent Kbuilds.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d33fcc1
    • L
      Merge tag 'ecryptfs-3.8-rc2-fixes' of... · 007f6c3a
      Linus Torvalds 提交于
      Merge tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
      
      Pull ecryptfs fixes from Tyler Hicks:
       "Two self-explanatory fixes and a third patch which improves
        performance: when overwriting a full page in the eCryptfs page cache,
        skip reading in and decrypting the corresponding lower page."
      
      * tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
        fs/ecryptfs/crypto.c: make ecryptfs_encode_for_filename() static
        eCryptfs: fix to use list_for_each_entry_safe() when delete items
        eCryptfs: Avoid unnecessary disk read and data decryption during writing
      007f6c3a
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 58890c06
      Linus Torvalds 提交于
      Pull Ceph fixes from Sage Weil:
       "Two of Alex's patches deal with a race when reseting server
        connections for open RBD images, one demotes some non-fatal BUGs to
        WARNs, and my patch fixes a protocol feature bit failure path."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
        libceph: fix protocol feature mismatch failure path
        libceph: WARN, don't BUG on unexpected connection states
        libceph: always reset osds when kicking
        libceph: move linger requests sooner in kick_requests()
      58890c06
    • M
      mm: mempolicy: Convert shared_policy mutex to spinlock · 42288fe3
      Mel Gorman 提交于
      Sasha was fuzzing with trinity and reported the following problem:
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:269
        in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
        2 locks held by trinity-main/6361:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
         #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
        Pid: 6361, comm: trinity-main Tainted: G        W
        3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
        Call Trace:
          __might_sleep+0x1c3/0x1e0
          mutex_lock_nested+0x29/0x50
          mpol_shared_policy_lookup+0x2e/0x90
          shmem_get_policy+0x2e/0x30
          get_vma_policy+0x5a/0xa0
          mpol_misplaced+0x41/0x1d0
          handle_pte_fault+0x465/0x6a0
      
      This was triggered by a different version of automatic NUMA balancing
      but in theory the current version is vunerable to the same problem.
      
      do_numa_page
        -> numa_migrate_prep
          -> mpol_misplaced
            -> get_vma_policy
              -> shmem_get_policy
      
      It's very unlikely this will happen as shared pages are not marked
      pte_numa -- see the page_mapcount() check in change_pte_range() -- but
      it is possible.
      
      To address this, this patch restores sp->lock as originally implemented
      by Kosaki Motohiro.  In the path where get_vma_policy() is called, it
      should not be calling sp_alloc() so it is not necessary to treat the PTL
      specially.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42288fe3
    • L
      Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 5439ca6b
      Linus Torvalds 提交于
      Pull ext4 bug fixes from Ted Ts'o:
       "Various bug fixes for ext4.  Perhaps the most serious bug fixed is one
        which could cause file system corruptions when performing file punch
        operations."
      
      * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: avoid hang when mounting non-journal filesystems with orphan list
        ext4: lock i_mutex when truncating orphan inodes
        ext4: do not try to write superblock on ro remount w/o journal
        ext4: include journal blocks in df overhead calcs
        ext4: remove unaligned AIO warning printk
        ext4: fix an incorrect comment about i_mutex
        ext4: fix deadlock in journal_unmap_buffer()
        ext4: split off ext4_journalled_invalidatepage()
        jbd2: fix assertion failure in jbd2_journal_flush()
        ext4: check dioread_nolock on remount
        ext4: fix extent tree corruption caused by hole punch
      5439ca6b
    • H
      mempolicy: remove arg from mpol_parse_str, mpol_to_str · a7a88b23
      Hugh Dickins 提交于
      Remove the unused argument (formerly no_context) from mpol_parse_str()
      and from mpol_to_str().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a88b23
    • H
      tmpfs mempolicy: fix /proc/mounts corrupting memory · f2a07f40
      Hugh Dickins 提交于
      Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
      mempolicy testing.  Very nasty.  Reading /proc/mounts, /proc/pid/mounts
      or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
      in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
      pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
      worse.  "mpol=prefer" and "mpol=prefer:Node" are equally toxic.
      
      Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
      when commit e17f74af "mempolicy: don't call mpol_set_nodemask() when
      no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
      which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
      With slab poisoning, you can then rely on mpol_to_str() to set the bit
      for node 0x6b6b, probably in the next page above the caller's stack.
      
      mpol_parse_str() is only called from shmem_parse_options(): no_context
      is always true, so call it unused for now, and remove !no_context code.
      Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
      expect.  Then mpol_to_str() can ignore its no_context argument also,
      the mpol being appropriately initialized whether contextualized or not.
      Rename its no_context unused too, and let subsequent patch remove them
      (that's not needed for stable backporting, which would involve rejects).
      
      I don't understand why MPOL_LOCAL is described as a pseudo-policy:
      it's a reasonable policy which suffers from a confusing implementation
      in terms of MPOL_PREFERRED with MPOL_F_LOCAL.  I believe this would be
      much more robust if MPOL_LOCAL were recognized in switch statements
      throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
      empty) nodes mask like everyone else, instead of its preferred_node
      variant (I presume an optimization from the days before MPOL_LOCAL).
      But that would take me too long to get right and fully tested.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2a07f40
    • E
      epoll: prevent missed events on EPOLL_CTL_MOD · 128dd175
      Eric Wong 提交于
      EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
      ensure events are not missed.  Since the modifications to the interest
      mask are not protected by the same lock as ep_poll_callback, we need to
      ensure the change is visible to other CPUs calling ep_poll_callback.
      
      We also need to ensure f_op->poll() has an up-to-date view of past
      events which occured before we modified the interest mask.  So this
      barrier also pairs with the barrier in wq_has_sleeper().
      
      This should guarantee either ep_poll_callback or f_op->poll() (or both)
      will notice the readiness of a recently-ready/modified item.
      
      This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
      http://thread.gmane.org/gmane.linux.kernel/1408782/Signed-off-by: NEric Wong <normalperson@yhbt.net>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
      Tested-by: N"Junchang(Jason) Wang" <junchang.wang@yale.edu>
      Cc: netdev@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      128dd175
  3. 02 1月, 2013 4 次提交
  4. 31 12月, 2012 1 次提交
    • L
      Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux · 4a490b78
      Linus Torvalds 提交于
      Pull DRM update from Dave Airlie:
       "This is a bit larger due to me not bothering to do anything since
        before Xmas, and other people working too hard after I had clearly
        given up.
      
        It's got the 3 main x86 driver fixes pulls, and a bunch of tegra
        fixes, doesn't fix the Ironlake bug yet, but that does seem to be
        getting closer.
      
         - radeon: gpu reset fixes and userspace packet support
         - i915: watermark fixes, workarounds, i830/845 fix,
         - nouveau: nvd9/kepler microcode fixes, accel is now enabled and
           working, gk106 support
         - tegra: misc fixes."
      
      * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (34 commits)
        Revert "drm: tegra: protect DC register access with mutex"
        drm: tegra: program only one window during modeset
        drm: tegra: clean out old gem prototypes
        drm: tegra: remove redundant tegra2_tmds_config entry
        drm: tegra: protect DC register access with mutex
        drm: tegra: don't leave clients host1x member uninitialized
        drm: tegra: fix front_porch <-> back_porch mixup
        drm/nve0/graph: fix fuc, and enable acceleration on all known chipsets
        drm/nvc0/graph: fix fuc, and enable acceleration on GF119
        drm/nouveau/bios: cache ramcfg strap on later chipsets
        drm/nouveau/mxm: silence output if no bios data
        drm/nouveau/bios: parse/display extra version component
        drm/nouveau/bios: implement opcode 0xa9
        drm/nouveau/bios: update gpio parsing apis to match current design
        drm/nouveau: initial support for GK106
        drm/radeon: add WAIT_UNTIL to evergreen VM safe reg list
        drm/i915: disable shrinker lock stealing for create_mmap_offset
        drm/i915: optionally disable shrinker lock stealing
        drm/i915: fix flags in dma buf exporting
        drm/radeon: add support for MEM_WRITE packet
        ...
      4a490b78