1. 14 1月, 2009 1 次提交
  2. 09 1月, 2009 1 次提交
    • K
      memcg: synchronized LRU · 08e552c6
      KAMEZAWA Hiroyuki 提交于
      A big patch for changing memcg's LRU semantics.
      
      Now,
        - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
      
        - LRU of page_cgroup is not synchronous with global LRU.
      
        - page and page_cgroup is one-to-one and statically allocated.
      
        - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
          - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
      
        - SwapCache is handled.
      
      And, when we handle LRU list of page_cgroup, we do following.
      
      	pc = lookup_page_cgroup(page);
      	lock_page_cgroup(pc); .....................(1)
      	mz = page_cgroup_zoneinfo(pc);
      	spin_lock(&mz->lru_lock);
      	.....add to LRU
      	spin_unlock(&mz->lru_lock);
      	unlock_page_cgroup(pc);
      
      But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
      So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
      
      This is a trial to remove this dirty nesting of locks.
      This patch changes mz->lru_lock to be zone->lru_lock.
      Then, above sequence will be written as
      
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      	mem_cgroup_add/remove/etc_lru() {
      		pc = lookup_page_cgroup(page);
      		mz = page_cgroup_zoneinfo(pc);
      		if (PageCgroupUsed(pc)) {
      			....add to LRU
      		}
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      
      This is much simpler.
      (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
          1. When pc->mem_cgroup can be modified.
             - at charge.
             - at account_move().
          2. at charge
             the PCG_USED bit is not set before pc->mem_cgroup is fixed.
          3. at account_move()
             the page is isolated and not on LRU.
      
      Pros.
        - easy for maintenance.
        - memcg can make use of laziness of pagevec.
        - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
        - LRU status of memcg will be synchronized with global LRU's one.
        - # of locks are reduced.
        - account_move() is simplified very much.
      Cons.
        - may increase cost of LRU rotation.
          (no impact if memcg is not configured.)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08e552c6
  3. 31 10月, 2008 1 次提交
  4. 10 10月, 2008 1 次提交
    • L
      Don't allow splice() to files opened with O_APPEND · efc968d4
      Linus Torvalds 提交于
      This is debatable, but while we're debating it, let's disallow the
      combination of splice and an O_APPEND destination.
      
      It's not entirely clear what the semantics of O_APPEND should be, and
      POSIX apparently expects pwrite() to ignore O_APPEND, for example.  So
      we could make up any semantics we want, including the old ones.
      
      But Miklos convinced me that we should at least give it some thought,
      and that accepting writes at arbitrary offsets is wrong at least for
      IS_APPEND() files (which always have O_APPEND set, even if the reverse
      isn't true: you can obviously have O_APPEND set on a regular file).
      
      So disallow O_APPEND entirely for now.  I doubt anybody cares, and this
      way we have one less gray area to worry about.
      Reported-and-argued-for-by: NMiklos Szeredi <miklos@szeredi.hu>
      Acked-by: NJens Axboe <ens.axboe@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc968d4
  5. 05 8月, 2008 1 次提交
  6. 27 7月, 2008 2 次提交
  7. 04 7月, 2008 1 次提交
  8. 28 5月, 2008 2 次提交
  9. 08 5月, 2008 1 次提交
  10. 07 5月, 2008 1 次提交
  11. 29 4月, 2008 1 次提交
  12. 10 4月, 2008 1 次提交
  13. 04 4月, 2008 1 次提交
    • H
      splice: use mapping_gfp_mask · 4cd13504
      Hugh Dickins 提交于
      The loop block driver is careful to mask __GFP_IO|__GFP_FS out of its
      mapping_gfp_mask, to avoid hangs under memory pressure.  But nowadays
      it uses splice, usually going through __generic_file_splice_read.  That
      must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cd13504
  14. 04 3月, 2008 1 次提交
  15. 11 2月, 2008 1 次提交
  16. 09 2月, 2008 1 次提交
  17. 01 2月, 2008 1 次提交
  18. 30 1月, 2008 1 次提交
  19. 29 1月, 2008 1 次提交
  20. 25 1月, 2008 1 次提交
  21. 17 10月, 2007 3 次提交
    • S
      Implement file posix capabilities · b5376771
      Serge E. Hallyn 提交于
      Implement file posix capabilities.  This allows programs to be given a
      subset of root's powers regardless of who runs them, without having to use
      setuid and giving the binary all of root's powers.
      
      This version works with Kaigai Kohei's userspace tools, found at
      http://www.kaigai.gr.jp/index.php.  For more information on how to use this
      patch, Chris Friedhoff has posted a nice page at
      http://www.friedhoff.org/fscaps.html.
      
      Changelog:
      	Nov 27:
      	Incorporate fixes from Andrew Morton
      	(security-introduce-file-caps-tweaks and
      	security-introduce-file-caps-warning-fix)
      	Fix Kconfig dependency.
      	Fix change signaling behavior when file caps are not compiled in.
      
      	Nov 13:
      	Integrate comments from Alexey: Remove CONFIG_ ifdef from
      	capability.h, and use %zd for printing a size_t.
      
      	Nov 13:
      	Fix endianness warnings by sparse as suggested by Alexey
      	Dobriyan.
      
      	Nov 09:
      	Address warnings of unused variables at cap_bprm_set_security
      	when file capabilities are disabled, and simultaneously clean
      	up the code a little, by pulling the new code into a helper
      	function.
      
      	Nov 08:
      	For pointers to required userspace tools and how to use
      	them, see http://www.friedhoff.org/fscaps.html.
      
      	Nov 07:
      	Fix the calculation of the highest bit checked in
      	check_cap_sanity().
      
      	Nov 07:
      	Allow file caps to be enabled without CONFIG_SECURITY, since
      	capabilities are the default.
      	Hook cap_task_setscheduler when !CONFIG_SECURITY.
      	Move capable(TASK_KILL) to end of cap_task_kill to reduce
      	audit messages.
      
      	Nov 05:
      	Add secondary calls in selinux/hooks.c to task_setioprio and
      	task_setscheduler so that selinux and capabilities with file
      	cap support can be stacked.
      
      	Sep 05:
      	As Seth Arnold points out, uid checks are out of place
      	for capability code.
      
      	Sep 01:
      	Define task_setscheduler, task_setioprio, cap_task_kill, and
      	task_setnice to make sure a user cannot affect a process in which
      	they called a program with some fscaps.
      
      	One remaining question is the note under task_setscheduler: are we
      	ok with CAP_SYS_NICE being sufficient to confine a process to a
      	cpuset?
      
      	It is a semantic change, as without fsccaps, attach_task doesn't
      	allow CAP_SYS_NICE to override the uid equivalence check.  But since
      	it uses security_task_setscheduler, which elsewhere is used where
      	CAP_SYS_NICE can be used to override the uid equivalence check,
      	fixing it might be tough.
      
      	     task_setscheduler
      		 note: this also controls cpuset:attach_task.  Are we ok with
      		     CAP_SYS_NICE being used to confine to a cpuset?
      	     task_setioprio
      	     task_setnice
      		 sys_setpriority uses this (through set_one_prio) for another
      		 process.  Need same checks as setrlimit
      
      	Aug 21:
      	Updated secureexec implementation to reflect the fact that
      	euid and uid might be the same and nonzero, but the process
      	might still have elevated caps.
      
      	Aug 15:
      	Handle endianness of xattrs.
      	Enforce capability version match between kernel and disk.
      	Enforce that no bits beyond the known max capability are
      	set, else return -EPERM.
      	With this extra processing, it may be worth reconsidering
      	doing all the work at bprm_set_security rather than
      	d_instantiate.
      
      	Aug 10:
      	Always call getxattr at bprm_set_security, rather than
      	caching it at d_instantiate.
      
      [morgan@kernel.org: file-caps clean up for linux/capability.h]
      [bunk@kernel.org: unexport cap_inode_killpriv]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morgan <morgan@kernel.org>
      Signed-off-by: NAndrew Morgan <morgan@kernel.org>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5376771
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • F
      readahead: combine file_ra_state.prev_index/prev_offset into prev_pos · f4e6b498
      Fengguang Wu 提交于
      Combine the file_ra_state members
      				unsigned long prev_index
      				unsigned int prev_offset
      into
      				loff_t prev_pos
      
      It is more consistent and better supports huge files.
      
      Thanks to Peter for the nice proposal!
      
      [akpm@linux-foundation.org: fix shift overflow]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4e6b498
  22. 16 10月, 2007 1 次提交
    • J
      splice: fix double kunmap() in vmsplice copy path · 6866bef4
      Jens Axboe 提交于
      The out label should not include the unmap, the only way to jump
      there already has unmapped the source.
      
      00002000
             f7c21a00 00000000 00000000 c0489036 00018e32 00000002 00000000
      00001000
      Call Trace:
       [<c0487dd9>] pipe_to_user+0xca/0xd3
       [<c0488233>] __splice_from_pipe+0x53/0x1bd
       [<c0454947>] ------------[ cut here ]------------
      filemap_fault+0x221/0x380
       [<c0487d0f>] pipe_to_user+0x0/0xd3
       [<c0489036>] sys_vmsplice+0x3b7/0x422
       [<c045ec3f>] kernel BUG at mm/highmem.c:206!
      handle_mm_fault+0x4d5/0x8eb
       [<c041ed5b>] kmap_atomic+0x1c/0x20
       [<c045d33d>] unmap_vmas+0x3d1/0x584
       [<c045f717>] free_pgtables+0x90/0xa0
       [<c041d84b>] pgd_dtor+0x0/0x1
       [<c044d665>] audit_syscall_exit+0x2aa/0x2c6
       [<c0407817>] do_syscall_trace+0x124/0x169
       [<c0404df2>] syscall_call+0x7/0xb
       =======================
      Code: 2d 00 d0 5b 00 25 00 00 e0 ff 29 invalid opcode: 0000 [#1]
      c2 89 d0 c1 e8 0c 8b 14 85 a0 6c 7c c0 4a 85 d2 89 14 85 a0 6c 7c c0 74 07
      31 c9 4a 75 15 eb 04 <0f> 0b eb fe 31 c9 81 3d 78 38 6d c0 78 38 6d c0 0f
      95 c1 b0 01
      EIP: [<c045bbc3>] kunmap_high+0x51/0x8e SS:ESP 0068:f5960df0
      SMP
      Modules linked in: netconsole autofs4 hidp nfs lockd nfs_acl rfcomm l2cap
      bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core
      ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath
      dm_mod video output sbs batteryac parport_pc lp parport sg i2c_piix4
      i2c_core floppy cfi_probe gen_probe scb2_flash mtd chipreg tg3 e1000 button
      ide_cd serio_raw cdrom aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd
      ehci_hcd ohci_hcd uhci_hcd
      CPU:    3
      EIP:    0060:[<c045bbc3>]    Not tainted VLI
      EFLAGS: 00010246   (2.6.23 #1)
      EIP is at kunmap_high+0x51/0x8e
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6866bef4
  23. 02 10月, 2007 1 次提交
  24. 27 7月, 2007 1 次提交
  25. 21 7月, 2007 1 次提交
  26. 20 7月, 2007 4 次提交
  27. 16 7月, 2007 1 次提交
    • J
      splice: direct splicing updates ppos twice · bcd4f3ac
      Jens Axboe 提交于
      OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> reported that he's noticed
      nfsd read corruption in recent kernels, and did the hard work of
      discovering that it's due to splice updating the file position twice.
      This means that the next operation would start further ahead than it
      should.
      
      nfsd_vfs_read()
          splice_direct_to_actor()
              while(len) {
                  do_splice_to()                     [update sd->pos]
                      -> generic_file_splice_read()  [read from sd->pos]
                  nfsd_direct_splice_actor()
                      -> __splice_from_pipe()        [update sd->pos]
      
      There's nothing wrong with the core splice code, but the direct
      splicing is an addon that calls both input and output paths.
      So it has to take care in locally caching offset so it remains correct.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bcd4f3ac
  28. 13 7月, 2007 2 次提交
  29. 10 7月, 2007 4 次提交