1. 09 5月, 2007 38 次提交
    • A
      Protect tty drivers list with tty_mutex · ca509f69
      Alexey Dobriyan 提交于
      Additions and removal from tty_drivers list were just done as well as
      iterating on it for /proc/tty/drivers generation.
      
      testing: modprobe/rmmod loop of simple module which does nothing but
      tty_register_driver() vs cat /proc/tty/drivers loop
      
      BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
       printing eip:
      c01cefa7
      *pde = 00000000
      Oops: 0000 [#1]
      PREEMPT
      last sysfs file: devices/pci0000:00/0000:00:1d.7/usb5/5-0:1.0/bInterfaceProtocol
      Modules linked in: ohci_hcd af_packet e1000 ehci_hcd uhci_hcd usbcore xfs
      CPU:    0
      EIP:    0060:[<c01cefa7>]    Not tainted VLI
      EFLAGS: 00010297   (2.6.21-rc4-mm1 #4)
      EIP is at vsnprintf+0x3a4/0x5fc
      eax: 6b6b6b6b   ebx: f6cb50f2   ecx: 6b6b6b6b   edx: fffffffe
      esi: c0354700   edi: f6cb6000   ebp: 6b6b6b6b   esp: f31f5e68
      ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
      Process cat (pid: 31864, ti=f31f4000 task=c1998030 task.ti=f31f4000)
      Stack: 00000000 c0103f20 c013003a c0103f20 00000000 f6cb50da 0000000a 00000f0e
             f6cb50f2 00000010 00000014 ffffffff ffffffff 00000007 c0354753 f6cb50f2
             f73e39dc f73e39dc 00000001 c0175416 f31f5ed8 f31f5ed4 0ee00000 f32090bc
      Call Trace:
       [<c0103f20>] restore_nocheck+0x12/0x15
       [<c013003a>] mark_held_locks+0x6d/0x86
       [<c0103f20>] restore_nocheck+0x12/0x15
       [<c0175416>] seq_printf+0x2e/0x52
       [<c0192895>] show_tty_range+0x35/0x1f3
       [<c0175416>] seq_printf+0x2e/0x52
       [<c0192add>] show_tty_driver+0x8a/0x1d9
       [<c01758f6>] seq_read+0x70/0x2ba
       [<c0175886>] seq_read+0x0/0x2ba
       [<c018d8e6>] proc_reg_read+0x63/0x9f
       [<c015e764>] vfs_read+0x7d/0xb5
       [<c018d883>] proc_reg_read+0x0/0x9f
       [<c015eab1>] sys_read+0x41/0x6a
       [<c0103e4e>] sysenter_past_esp+0x5f/0x99
       =======================
      Code: 00 8b 4d 04 e9 44 ff ff ff 8d 4d 04 89 4c 24 50 8b 6d 00 81 fd ff 0f 00 00 b8 a4 c1 35 c0 0f 46 e8 8b 54 24 2c 89 e9 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 89 c6 8b 44 24 28 89
      EIP: [<c01cefa7>] vsnprintf+0x3a4/0x5fc SS:ESP 0068:f31f5e68
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca509f69
    • M
      Remove do_sync_file_range() · ef51c976
      Mark Fasheh 提交于
      Remove do_sync_file_range() and convert callers to just use
      do_sync_mapping_range().
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef51c976
    • R
      reiserfs: proc support requires PROC_FS · 880ebdc5
      Randy Dunlap 提交于
      REISER_FS /proc option needs to depend on PROC_FS.
      
      fs/reiserfs/procfs.c: In function 'show_super':
      fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'max_hash_collisions'
      fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'breads'
      fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'bread_miss'
      fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key'
      fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_fs_changed'
      fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_restarted'
      fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'insert_item_restarted'
      fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'paste_into_item_restarted'
      fs/reiserfs/procfs.c:138: error: 'reiserfs_proc_info_data_t' has no member named 'cut_from_item_restarted'
      fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_solid_item_restarted'
      fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_item_restarted'
      fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaked_oid'
      fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaves_removable'
      fs/reiserfs/procfs.c: In function 'show_per_level':
      fs/reiserfs/procfs.c:184: error: 'reiserfs_proc_info_data_t' has no member named 'balance_at'
      fs/reiserfs/procfs.c:185: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_read_at'
      fs/reiserfs/procfs.c:186: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_fs_changed'
      fs/reiserfs/procfs.c:187: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_restarted'
      fs/reiserfs/procfs.c:188: error: 'reiserfs_proc_info_data_t' has no member named 'free_at'
      fs/reiserfs/procfs.c:189: error: 'reiserfs_proc_info_data_t' has no member named 'items_at'
      fs/reiserfs/procfs.c:190: error: 'reiserfs_proc_info_data_t' has no member named 'can_node_be_removed'
      fs/reiserfs/procfs.c:191: error: 'reiserfs_proc_info_data_t' has no member named 'lnum'
      fs/reiserfs/procfs.c:192: error: 'reiserfs_proc_info_data_t' has no member named 'rnum'
      fs/reiserfs/procfs.c:193: error: 'reiserfs_proc_info_data_t' has no member named 'lbytes'
      fs/reiserfs/procfs.c:194: error: 'reiserfs_proc_info_data_t' has no member named 'rbytes'
      fs/reiserfs/procfs.c:195: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors'
      fs/reiserfs/procfs.c:196: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors_restart'
      fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_l_neighbor'
      fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_r_neighbor'
      fs/reiserfs/procfs.c: In function 'show_bitmap':
      fs/reiserfs/procfs.c:224: error: 'reiserfs_proc_info_data_t' has no member named 'free_block'
      fs/reiserfs/procfs.c:225: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:226: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:227: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:228: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:229: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
      fs/reiserfs/procfs.c: In function 'show_journal':
      fs/reiserfs/procfs.c:384: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:385: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:386: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:387: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:388: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:389: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:390: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:391: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:392: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:393: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:394: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
      fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_init':
      fs/reiserfs/procfs.c:504: warning: implicit declaration of function '__PINFO'
      fs/reiserfs/procfs.c:504: error: request for member 'lock' in something not a structure or union
      fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_done':
      fs/reiserfs/procfs.c:544: error: request for member 'lock' in something not a structure or union
      fs/reiserfs/procfs.c:545: error: request for member 'exiting' in something not a structure or union
      fs/reiserfs/procfs.c:546: error: request for member 'lock' in something not a structure or union
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      880ebdc5
    • A
      /proc/*/oom_score oops re badness · 19c5d45a
      Alexey Dobriyan 提交于
      Eternal quest to make
      
      	while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done
      	while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done
      	while true; do modprobe xfs; rmmod xfs; done
      
      work reliably continues and now kernel oopses in the following way:
      
      BUG: unable to handle ... at virtual address 6b6b6b6b
      EIP is at badness
      process: cat
      	proc_oom_score
      	proc_info_read
      	sys_fstat64
      	vfs_read
      	proc_info_read
      	sys_read
      
      Failing code is prefetch hidden in list_for_each_entry() in badness().
      badness() is reachable from two points. One is proc_oom_score, another
      is out_of_memory() => select_bad_process() => badness().
      
      Second path grabs tasklist_lock, while first doesn't.
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19c5d45a
    • E
      VFS: delay the dentry name generation on sockets and pipes · c23fbb6b
      Eric Dumazet 提交于
      1) Introduces a new method in 'struct dentry_operations'.  This method
         called d_dname() might be called from d_path() to build a pathname for
         special filesystems.  It is called without locks.
      
         Future patches (if we succeed in having one common dentry for all
         pipes/sockets) may need to change prototype of this method, but we now
         use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);
      
      2) Adds a dynamic_dname() helper function that eases d_dname() implementations
      
      3) Defines d_dname method for sockets : No more sprintf() at socket
         creation.  This is delayed up to the moment someone does an access to
         /proc/pid/fd/...
      
      4) Defines d_dname method for pipes : No more sprintf() at pipe
         creation.  This is delayed up to the moment someone does an access to
         /proc/pid/fd/...
      
      A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
      *nice* speedup on my Pentium(M) 1.6 Ghz :
      
      3.090 s instead of 3.450 s
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c23fbb6b
    • M
      add file position info to proc · 27932742
      Miklos Szeredi 提交于
      Add support for finding out the current file position, open flags and
      possibly other info in the future.
      
      These new entries are added:
      
        /proc/PID/fdinfo/FD
        /proc/PID/task/TID/fdinfo/FD
      
      For each fd the information is provided in the following format:
      
      pos:	1234
      flags:	0100002
      
      [bunk@stusta.de: make struct proc_fdinfo_file_operations static]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27932742
    • E
      procfs: reorder struct pid_dentry to save space on 64bit archs, and constify them · c5141e6d
      Eric Dumazet 提交于
      Change the order of fields of struct pid_entry (file fs/proc/base.c) in order
      to avoid a hole on 64bit archs.  (8 bytes saved per object)
      
      Also change all pid_entry arrays to be const qualified, to make clear they
      must not be modified.
      
      Before (on x86_64) :
      
      # size fs/proc/base.o
         text    data     bss     dec     hex filename
        15549    2192       0   17741    454d fs/proc/base.o
      
      After :
      
      # size fs/proc/base.o
         text    data     bss     dec     hex filename
        17229     176       0   17405    43fd fs/proc/base.o
      
      Thats 336 bytes saved on kernel size on x86_64
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5141e6d
    • K
      proc: maps protection · 5096add8
      Kees Cook 提交于
      The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
      information about the memory location and usage of processes.  Issues:
      
      - maps should not be world-readable, especially if programs expect any
        kind of ASLR protection from local attackers.
      - maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
        check the maps when %n is in a *printf call, and a setuid(getuid())
        process wouldn't be able to read its own maps file.  (For reference
        see http://lkml.org/lkml/2006/1/22/150)
      - a system-wide toggle is needed to allow prior behavior in the case of
        non-root applications that depend on access to the maps contents.
      
      This change implements a check using "ptrace_may_attach" before allowing
      access to read the maps contents.  To control this protection, the new knob
      /proc/sys/kernel/maps_protect has been added, with corresponding updates to
      the procfs documentation.
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: New sysctl numbers are old hat]
      Signed-off-by: NKees Cook <kees@outflux.net>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5096add8
    • C
      namei.c: remove utterly outdated comment · 5843205b
      Christoph Hellwig 提交于
      We don't have a routine called namei() anymore since at least 2.3.x, and
      the comment is just totally out of sync with the current lookup logic.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5843205b
    • C
      vfs: remove superflous sb == NULL checks · acb0c854
      Christoph Hellwig 提交于
      inode->i_sb is always set, not need to check for it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acb0c854
    • A
      proc: remove pathetic ->deleted WARN_ON · 578c8183
      Alexey Dobriyan 提交于
      WARN_ON(de && de->deleted); is sooo unreliable. Why?
      
      proc_lookup				remove_proc_entry
      ===========				=================
      lock_kernel();
      spin_lock(&proc_subdir_lock);
      [find proc entry]
      spin_unlock(&proc_subdir_lock);
      					spin_lock(&proc_subdir_lock);
      					[find proc entry]
      
      proc_get_inode
      ==============
      WARN_ON(de && de->deleted);			...
      
      					if (!atomic_read(&de->count))
      						free_proc_entry(de);
      					else
      						de->deleted = 1;
      
      So, if you have some strange oops [1], and doesn't see this WARN_ON it means
      nothing.
      
      [1] try_module_get() of module which doesn't exist, two lines below
          should suffice, or not?
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      578c8183
    • D
      Fix race between proc_readdir and remove_proc_entry · 59cd0cbc
      Darrick J. Wong 提交于
      Fix the following race:
      
      proc_readdir				remove_proc_entry
      ============				=================
      
      spin_lock(&proc_subdir_lock);
      [choose PDE to start filldir from]
      spin_unlock(&proc_subdir_lock);
      					spin_lock(&proc_subdir_lock);
      					[find PDE]
      					[free PDE, refcount is 0]
      					spin_unlock(&proc_subdir_lock);
      		    /* boom */
      if (filldir(dirent, de->name, ...
      
      [de_put on error path --adobriyan]
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59cd0cbc
    • A
      Fix race between proc_get_inode() and remove_proc_entry() · 7695650a
      Alexey Dobriyan 提交于
      proc_lookup				remove_proc_entry
      ===========				=================
      
      lock_kernel();
      spin_lock(&proc_subdir_lock);
      [find PDE with refcount 0]
      spin_unlock(&proc_subdir_lock);
      					spin_lock(&proc_subdir_lock);
      					[find PDE with refcount 0]
      					[check refcount and free PDE]
      					spin_unlock(&proc_subdir_lock);
      proc_get_inode:
      	de_get(de); /* boom */
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7695650a
    • M
      add filesystem subtype support · 79c0b2df
      Miklos Szeredi 提交于
      There's a slight problem with filesystem type representation in fuse
      based filesystems.
      
      From the kernel's view, there are just two filesystem types: fuse and
      fuseblk.  From the user's view there are lots of different filesystem
      types.  The user is not even much concerned if the filesystem is fuse based
      or not.  So there's a conflict of interest in how this should be
      represented in fstab, mtab and /proc/mounts.
      
      The current scheme is to encode the real filesystem type in the mount
      source.  So an sshfs mount looks like this:
      
        sshfs#user@server:/   /mnt/server    fuse   rw,nosuid,nodev,...
      
      This url-ish syntax works OK for sshfs and similar filesystems.  However
      for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
      the kernel expects the mount source to be a real device name.
      
      A possibly better scheme would be to encode the real type in the type
      field as "type.subtype".  So fuse mounts would look like this:
      
        /dev/hda1       /mnt/windows   fuseblk.ntfs-3g   rw,...
        user@server:/   /mnt/server    fuse.sshfs        rw,nosuid,nodev,...
      
      This patch adds the necessary code to the kernel so that this can be
      correctly displayed in /proc/mounts.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c0b2df
    • D
      epoll: optimizations and cleanups · 6192bd53
      Davide Libenzi 提交于
      Epoll is doing multiple passes over the ready set at the moment, because of
      the constraints over the f_op->poll() call.  Looking at the code again, I
      noticed that we already hold the epoll semaphore in read, and this
      (together with other locking conditions that hold while doing an
      epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
      (in a single pass).
      
      This is a stress application that can be used to test the new code.  It
      spwans multiple thread and call epoll_wait() and epoll_ctl() from many
      threads.  Stress tested on my dual Opteron 254 w/out any problems.
      
      http://www.xmailserver.org/totalmess.c
      
      This is not a benchmark, just something that tries to stress and exploit
      possible problems with the new code.
      Also, I made a stupid micro-benchmark:
      
      http://www.xmailserver.org/epwbench.c
      
      [1] Considering that epoll must be thread-safe, there are five ways we can
          be hit during an epoll_wait() transfer loop (ep_send_events()):
      
          1) The epoll fd going away and calling ep_free
             This just can't happen, since we did an fget() in sys_epoll_wait
      
          2) An epoll_ctl(EPOLL_CTL_DEL)
             This can't happen because epoll_ctl() gets ep->sem in write, and
             we're holding it in read during ep_send_events()
      
          3) An fd stored inside the epoll fd going away
             This can't happen because in eventpoll_release_file() we get
             ep->sem in write, and we're holding it in read during
             ep_send_events()
      
          4) Another epoll_wait() happening on another thread
             They both can be inside ep_send_events() at the same time, we get
             (splice) the ready-list under the spinlock, so each one will get
             its own ready list. Note that an fd cannot be at the same time
             inside more than one ready list, because ep_poll_callback() will
             not re-queue it if it sees it already linked:
      
             if (ep_is_linked(&epi->rdllink))
                      goto is_linked;
      
             Another case that can happen, is two concurrent epoll_wait(),
             coming in with a userspace event buffer of size, say, ten.
             Suppose there are 50 event ready in the list. The first
             epoll_wait() will "steal" the whole list, while the second, seeing
             no events, will go to sleep. But at the end of ep_send_events() in
             the first epoll_wait(), we will re-inject surplus ready fds, and we
             will trigger the proper wake_up to the second epoll_wait().
      
          5) ep_poll_callback() hitting us asyncronously
             This is the tricky part. As I said above, the ep_is_linked() test
             done inside ep_poll_callback(), will guarantee us that until the
             item will result linked to a list, ep_poll_callback() will not try
             to re-queue it again (read, write data on any of its members). When
             we do a list_del() in ep_send_events(), the item will still satisfy
             the ep_is_linked() test (whatever data is written in prev/next,
             it'll never be its own pointer), so ep_poll_callback() will still
             leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
             that it'll become visible to ep_poll_callback(), but at the point
             we're already past it.
      
      [akpm@osdl.org: 80 cols]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6192bd53
    • D
      ext3: dirindex error pointer issues · fedee54d
      Dmitriy Monakhov 提交于
      - ext3_dx_find_entry() exit with out setting proper error pointer
      
      - do_split() exit with out setting proper error pointer
        it is realy painful because many callers contain folowing code:
      
                de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
                if (!(de))
                             return retval;
                <<< WOW retval wasn't changed by do_split(), so caller failed
                <<< but return SUCCESS :)
      
      - Rearrange do_split() error path. Current error path is realy ugly, all
        this up and down jump stuff doesn't make code easy to understand.
      
      [dmonakhov@sw.ru: fix annoying fake error messages]
      Signed-off-by: NMonakhov Dmitriy <dmonakhov@openvz.org>
      Cc: Andreas Dilger <adilger@clusterfs.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NMonakhov Dmitriy <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fedee54d
    • B
      Merge sys_clone()/sys_unshare() nsproxy and namespace handling · e3222c4e
      Badari Pulavarty 提交于
      sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
      namespaces.  But they have different code paths.
      
      This patch merges all the nsproxy and its associated namespace copy/clone
      handling (as much as possible).  Posted on container list earlier for
      feedback.
      
      - Create a new nsproxy and its associated namespaces and pass it back to
        caller to attach it to right process.
      
      - Changed all copy_*_ns() routines to return a new copy of namespace
        instead of attaching it to task->nsproxy.
      
      - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
      
      - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
        just incase.
      
      - Get rid of all individual unshare_*_ns() routines and make use of
        copy_*_ns() instead.
      
      [akpm@osdl.org: cleanups, warning fix]
      [clg@fr.ibm.com: remove dup_namespaces() declaration]
      [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
      [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <containers@lists.osdl.org>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3222c4e
    • N
      exec: fix remove_arg_zero · 4fc75ff4
      Nick Piggin 提交于
      Petr Tesarik discovered a problem in remove_arg_zero(). He writes:
      
       When a script is loaded, load_script() replaces argv[0] with the
       name of the interpreter and the filename passed to the exec syscall.
       However, there is no guarantee that the length of the interpreter
       name plus the length of the filename is greater than the length of
       the original argv[0]. If the difference happens to cross a page boundary,
       setup_arg_pages() will call put_dirty_page() [aka install_arg_page()]
       with an address outside the VMA.
      
       Therefore, remove_arg_zero() must free all pages which would be unused
       after the argument is removed.
      
      So, rewrite the remove_arg_zero function without gotos, with a few comments,
      and with the commonly used explicit index/offset. This fixes the problem
      and makes it easier to understand as well.
      
      [a.p.zijlstra@chello.nl: add comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fc75ff4
    • R
      reiserfs: correct misspelled "REISERFS_PROC_INFO" to "CONFIG_REISERFS_PROC_INFO" · f87367a6
      Robert P. J. Day 提交于
      Correct the misspelling of the preprocessor check of a Kconfig option to refer
      to CONFIG_REISERFS_PROC_INFO and not just the incorrect REISERFS_PROC_INFO.
      Signed-off-by: NRobert P. J. Day <rpjday@mindspring.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f87367a6
    • A
      reiserfs: shrink superblock if no xattrs · fe08a9d4
      Alexey Dobriyan 提交于
      This makes in-core superblock fit into one cacheline here.
      
      Before:
          struct dentry *            xattr_root;           /*   124     4 */
          /* --- cacheline 1 boundary (128 bytes) --- */
          struct rw_semaphore        xattr_dir_sem;        /*   128    12 */
          int                        j_errno;              /*   140     4 */
          }; /* size: 144, cachelines: 2 */
             /* sum members: 142, holes: 1, sum holes: 2 */
             /* last cacheline: 16 bytes */
      
      After:
          int                        j_errno;              /*   124     4 */
          /* --- cacheline 1 boundary (128 bytes) --- */
          }; /* size: 128, cachelines: 1 */
             /* sum members: 126, holes: 1, sum holes: 2 */
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: <reiserfs-dev@namesys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe08a9d4
    • D
      reiserfs: possible null pointer dereference during resize · 2d3466a3
      Dmitriy Monakhov 提交于
      sb_read may return NULL, let's explicitly check it.  If so free new bitmap
      blocks array, after this we may safely exit as it done above during bitmap
      allocation.
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d3466a3
    • D
      freevxfs: possible null pointer dereference fix · 82f703bb
      Dmitriy Monakhov 提交于
      sb_read may return NULL, so let's explicitly check it.
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82f703bb
    • V
      is_power_of_2 in fs/block_dev.c · 1368c4f2
      Vignesh Babu BM 提交于
      Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2
      Signed-off-by: Nvignesh babu <vignesh.babu@wipro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1368c4f2
    • V
      is_power_of_2 in fs/hfs · e1b5c1d3
      Vignesh Babu BM 提交于
      Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2
      Signed-off-by: Nvignesh babu <vignesh.babu@wipro.com>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1b5c1d3
    • V
      is_power_of_2 in fat · e7d709c0
      Vignesh Babu BM 提交于
      Replacing (n & (n-1)) in the context of power of 2 checks with
      is_power_of_2
      Signed-off-by: Nvignesh babu <vignesh.babu@wipro.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d709c0
    • F
      devpts: add fsnotify create event · 3972b7f6
      Florin Malita 提交于
      Currently, devpts doesn't generate an fsnotify event upon pts creation
      because the regular vfs paths aren't involved.  Deallocation, on the other
      hand, correctly generates a nameremove event thanks to the d_delete()
      invocation in devpts_pty_kill().
      
      This patch adds the missing fsnotify_create() trigger in devpts_pty_new().
      Signed-off-by: NFlorin Malita <fmalita@gmail.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3972b7f6
    • C
      use use SEEK_MAX to validate user lseek arguments · 1ae7075b
      Chris Snook 提交于
      Add SEEK_MAX and use it to validate lseek arguments from userspace.
      Signed-off-by: NChris Snook <csnook@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ae7075b
    • C
      use symbolic constants in generic lseek code · 7b8e8924
      Chris Snook 提交于
      Convert magic numbers to SEEK_* values from fs.h
      Signed-off-by: NChris Snook <csnook@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8e8924
    • A
      mm: shrink parent dentries when shrinking slab · 24c32d73
      Andrew Morton 提交于
      Teach the dentry slab shrinker to aggressively shrink parent dentries when
      shrinking the dentry cache.
      
      This is done to attempt to improve the situation where the dentry slab cache
      gets a lot of internal fragmentation due to pages containing directory
      dentries.  It is expected that this change will cause some of those dentries
      to be reaped earlier, and with less scanning.
      
      Needs careful testing.
      
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24c32d73
    • M
      fix quadratic behavior of shrink_dcache_parent() · d52b9086
      Miklos Szeredi 提交于
      The time shrink_dcache_parent() takes, grows quadratically with the depth
      of the tree under 'parent'.  This starts to get noticable at about 10,000.
      
      These kinds of depths don't occur normally, and filesystems which invoke
      shrink_dcache_parent() via d_invalidate() seem to have other depth
      dependent timings, so it's not even easy to expose this problem.
      
      However with FUSE it's easy to create a deep tree and d_invalidate()
      will also get called.  This can make a syscall hang for a very long
      time.
      
      This is the original discovery of the problem by Russ Cox:
      
        http://article.gmane.org/gmane.comp.file-systems.fuse.devel/3826
      
      The following patch fixes the quadratic behavior, by optionally allowing
      prune_dcache() to prune ancestors of a dentry in one go, instead of doing
      it one at a time.
      
      Common code in dput() and prune_one_dentry() is extracted into a new helper
      function d_kill().
      
      shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use
      the ancestry-pruner option.  Only for shrink_dcache_memory() is this
      behavior not desirable, so it keeps using the old algorithm.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Maneesh Soni <maneesh@in.ibm.com>
      Acked-by: N"Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d52b9086
    • W
      reduce size of task_struct on 64-bit machines · 97dc32cd
      William Cohen 提交于
      This past week I was playing around with that pahole tool
      (http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
      of various struct in the kernel.  I was surprised by the size of the
      task_struct on x86_64, approaching 4K.  I looked through the fields in
      task_struct and found that a number of them were declared as "unsigned
      long" rather than "unsigned int" despite them appearing okay as 32-bit
      sized fields.  On x86_64 "unsigned long" ends up being 8 bytes in size and
      forces 8 byte alignment.  Is there a reason there a reason they are
      "unsigned long"?
      
      The patch below drops the size of the struct from 3808 bytes (60 64-byte
      cachelines) to 3760 bytes (59 64-byte cachelines).  A couple other fields
      in the task struct take a signficant amount of space:
      
      struct thread_struct       thread;               688
      struct held_lock           held_locks[30];       1680
      
      CONFIG_LOCKDEP is turned on in the .config
      
      [akpm@linux-foundation.org: fix printk warnings]
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97dc32cd
    • M
      ext2/3/4: fix file date underflow on ext2 3 filesystems on 64 bit systems · 4d7bf11d
      Markus Rechberger 提交于
      Taken from http://bugzilla.kernel.org/show_bug.cgi?id=5079
      
      signed long ranges from -2.147.483.648 to 2.147.483.647 on x86 32bit
      
      10000011110110100100111110111101 .. -2,082,844,739
      10000011110110100100111110111101 ..  2,212,122,557 <- this currently gets
      stored on the disk but when converting it to a 64bit signed long value it loses
      its sign and becomes positive.
      
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: <linux-ext4@vger.kernel.org>
      
      Andreas says:
      
      This patch is now treating timestamps with the high bit set as negative
      times (before Jan 1, 1970).  This means we lose 1/2 of the possible range
      of timestamps (lopping off 68 years before unix timestamp overflow -
      now only 30 years away :-) to handle the extremely rare case of setting
      timestamps into the distant past.
      
      If we are only interested in fixing the underflow case, we could just
      limit the values to 0 instead of storing negative values.  At worst this
      will skew the timestamp by a few hours for timezones in the far east
      (files would still show Jan 1, 1970 in "ls -l" output).
      
      That said, it seems 32-bit systems (mine at least) allow files to be set
      into the past (01/01/1907 works fine) so it seems this patch is bringing
      the x86_64 behaviour into sync with other kernels.
      
      On the plus side, we have a patch that is ready to add nanosecond timestamps
      to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so
      this extends the maximum date to 2242.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d7bf11d
    • A
      Allow access to /proc/$PID/fd after setuid() · 8948e11f
      Alexey Dobriyan 提交于
      /proc/$PID/fd has r-x------ permissions, so if process does setuid(), it
      will not be able to access /proc/*/fd/. This breaks fstatat() emulation
      in glibc.
      
      open("foo", O_RDONLY|O_DIRECTORY)       = 4
      setuid32(65534)                         = 0
      stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied)
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Acked-By: NKirill Korotaev <dev@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8948e11f
    • A
      block_write_full_page(): report ENOSPC · 7e4c3690
      Andrew Morton 提交于
      block_write_full_page() forgot to propagate ENPSOC into the address_space.
      
      Cc: Guillaume Chazarain <guichaz@yahoo.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e4c3690
    • G
      Factor outstanding I/O error handling · 3e9f45bd
      Guillaume Chazarain 提交于
      Cleanup: setting an outstanding error on a mapping was open coded too many
      times.  Factor it out in mapping_set_error().
      Signed-off-by: NGuillaume Chazarain <guichaz@yahoo.fr>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e9f45bd
    • J
      uml: hostfs style fixes · f1adc05e
      Jeff Dike 提交于
      hostfs needed some style goodness.
      Signed-off-by: NJeff Dike <jdike@linux.intel.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1adc05e
    • A
      uml: make hostfs_setattr() support operations on unlinked open files · 5822b7fa
      Alberto Bertogli 提交于
      This patch allows hostfs_setattr() to work on unlinked open files by calling
      set_attr() (the userspace part) with the inode's fd.
      
      Without this, applications that depend on doing attribute changes to unlinked
      open files will fail.
      
      It works by using the fd versions instead of the path ones (for example
      fchmod() instead of chmod(), fchown() instead of chown()) when an fd is
      available.
      Signed-off-by: NAlberto Bertogli <albertito@gmail.com>
      Signed-off-by: NJeff Dike <jdike@linux.intel.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5822b7fa
    • D
      mm: move common segment checks to separate helper function · 0ceb3314
      Dmitriy Monakhov 提交于
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NMonakhov Dmitriy <dmonakhov@openvz.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NAnton Altaparmakov <aia21@cam.ac.uk>
      Acked-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ceb3314
  2. 08 5月, 2007 2 次提交