newuserfs -- a userspace filesystem hook for Linux 2.4

Status

newuserfs is a dead project, of historical interest at most. If you want to write userspace filesystems for Linux, you might like to look at one of these projects:

FUSE, also known as AVFS

UVFS

Happy hacking! -- mbp, 2002-08-26

Introduction

These are obviously rough notes. The code is in samba.org CVS, in module newuserfs. You can browse it through cvsweb here, and get instructions on downloading here

This will be submitted as a useful self-directed project for COMP3300 at ANU. So you know it's more likely to at least partly get working faster than some open-source projects -- my assignment is due in October 2001! ;-)

Documentation

Progress paper for COMP3300 (ps pdf)

IEEE 1058-1998 Software Project Management Plan for ENGN3221 (ps pdf)

Poster pages ps

Priorities

Allow implementation of filesystems in any language.

Allow non-root users to mount filesystems, if possible without compromising system security.

Implement purely as a module -- don't require a kernel patch or reboot to install it.

Userspace code should not depend on a particular kernel version, or machine, though it is OK if the format is different across machines. So, you can send things in native endianness, as long as that is determined at least at compile time.

Architecture

Kernel may have multiple outstanding requests. They can return out-of-order. Requests are tagged as in NFS.

One approach is to write a userspace NFS server, but the NFS protocol is both more limited than the native VFS, and it also imposes a fair bit of work, for example in mapping to persistent filehandles. Similarly for Coda, which is used by podfuk.

The basis of Unix filesystems is the inode, so they have to be core to the newuserfs interface. This lets filesystems allow IO on unlinked files, and so on. It's the userspace server's responsibility to make up consistent inode numbers. (?? how will this interact with the dentry cache?)

We don't necessarily need to exactly mirror the kernel VFS interface. Jeremy thought this might easily allow prototyping filesystems in userspace, but there are many more issues than just the VFS interface. For example, doing IO though the buffer head system is quite different to anything that can be managed in userspace.

Client process is started with a socket on stdin/stdout, and it should exit only when it gets EOF on that socket. Bad things might happen otherwise.

The mount process passes two file descriptors to the kernel, which are normally connected to pipes or sockets running to the userspace process. They're converted into struct file objects, which are grabbed by the kernel.

I think we might start by making all kernel processes block until the filesystem returns. We can allow overlapped calls in the future. We know this was not required for the original userfs because the userspace program is not required to support more than one concurrent request.

We can probably piggyback on the kernel's existing sunrpc code to do marshalling/unmarshalling. Cool.

Races/locking/concurrency problems. Hard question.

Perhaps say something philosophical about the way there is a mirror from VFS to a specific FS in the kernel. This system mirrors that out into userspace. As for the VFS, calls are translated across.

Linux combines elements of the four OS structures:

monolithic
modular
microkernel
layered

The userfs hack adds a little more of the microkernel flavour to the melange.

Evolution from the original userfs

Don't use C++

Using network byte order doesn't seem very useful: if you want to cross machines boundaries that is better done by having a userspace proxy server. Perhaps not so expensive.

Perhaps use a text protocol?

Protocol

What do we need?

In calls:

Sequence number, to eventually allow for overlapped calls.

Operation code

File and inode tickets

Parameters, which might be strings or integers, or something else.

Bulk data, e.g. for write calls.

In responses:

Matching sequence number.

Result: positive for success, zero, or negative for error. (Really?)

Bulk data, e.g. for read calls.

It may be simpler to keep most fields (e.g. file and inode tickets) fixed even if they're not relevant to particular calls.

Everything should be 64-bit. ???

Should this be binary (XDR?) or text? Text is possibly easier, except when we need to embed binary data. Even filenames might be hard to quote.

We need moderately complex encoding to handle readdir results. Can we/should we just ship binary structures?

Open files are identified by handles -- 32-bit integers, say.

Files are opened by presumably giving a handle for a parent directory, and the name of the file inside the directory. Perhaps if you've read a directory and seen an inode number you'll want to open it directly, but I don't think the kernel ever does that. (Another strength compared to NFS.)

Interface methods

pipes or sockets

same as original userfs
you can select/poll on the fd to be notified when something is available.
we don't have to explicitly implement wait queues
can be implemented in any language that can do binary IO, including Java
the pipe/socket is transient -- it doesn't require a /proc or /dev entry
I think this is what rtnetlink does, so there is some precedent.

ioctls

tridge reckons this is more tasteful: the kernel's not meant to use binary protocols on files to communicate with userspace processes (although there are several binary formats that it knows about)
each transfer is clearly atomic: just one turnaround, and cannot be split across multiple readers.

How do we know what filesystem the ioctl call is talking about? What file are the ioctls executed against? I guess we could use the same fd-stealing system to set it up.

Unrelated processes that can open the ioctl file can do their own operations, which may or may not be an advantage. I guess the same would apply for named pipes.

XXX: A problem with using pipes is that the sending process has to block for an arbitrarily long time waiting for the pipe, and (worse) it probably has to be uninterruptible while blocked.

On the other hand if we had an explicit request queue from which messages were fetched by ioctls, then each task could just push in its request and be in an interruptible sleep on the response. That probably gets us closer to the goal of the worst possible failure being an interruptible hang.

I wonder what happens if a kernel task sleeps on this? Probably bad things. We could perhaps have a watchdog on calls to the userspace server, so that after 15s they're given up for dead and we get EIO. However, some servers might take a long time. At any rate at least if the userspace server crashes then we can detect that.

It's easier to detect a crashed server with a pipe, because the pipe just closes. I guess we could do IOCTLs on a pipe, which would also solve the problem of choosing an fd.

To implement an ioctl without hackiness we need to also create the inode_operations for the file on which the inode is called. Almost always this means creating a new char device. That's pretty undesirable, though. I guess we could also create a /proc entry.

Another problem with that is knowing how many such entries to create, and associating them with the appropriate server process.

It's practically impossible that unrelated server processes would all want to collaborate on a filesystem.

We also can't use a single interface for all filesystems because farming out messages between processes would be too messy.

What operations?

noop. Used to check things are working.

mount, umount. Mount has to return the inum for the root inode.

open, close. All process credentials are passed on open, and access checks are made then. (This is one place where we're possibly different to NFS.) open should perhaps return a file ticket that should be passed on future operations, because it represents credentials.

creat. Perhaps open should only deal with existing files?

iget, iwrite, iput. [Remember to explain kernel jargon where "get" implies incrementing the use count of something, and "put" decrements the use count and puts it back on the shelf.]

read, write, lseek, ...

readdir

statvfs

stat, lstat

chmod/fchmod, chown/chgrp/fchown/fchgrp, utimes, ... In the kernel these are all wrapped up in setattr(), so perhaps we should do that too.

fsync?

rename, mkdir, rmdir, link, unlink

mknod. mknod is bogus.

readlink, symlink, followlink. The kernel treats followlink separately from readlink, so symlinks can point somewhere but look like they point somewhere else. I guess they should default to using readlink if this does not exist.

check permission. Permission to do what? inode_operations has a function which checks read/write permission, presumably mapping to Why do this separately rather than just allowing operations to fail with EPERM?

locking. We need byte range locks, and possibly other stuff.

Should we just pass all operations through, or probe to decide which ones should have default behaviour? Should there be an explicit result for "don't care?"

If we're started by a non-root user, then are there any extra security conditions we need to impose? Can we restrict them to only having files they own themselves? Certainly we should not allow setuid files, etc.

Most things from inode_operations and file_operations should be passed down.

Do we want to distinguish operations on a file vs operations on a directory? I guess probably so.

Also, do we need to mirror the concept of dentries? Are they useful in userspace separately from files?

We can pass down locking later...

ACLs later too...

It's unclear that readv/writev need to map down. They should default to splitting into many reads/writes.

Read Documentation/filesystems/vfs.txt.

If running as a user, you should own the file on which you are mounting.

The worst possible consequence of a bug or crash in the server process should be an interruptible hang in the process doing disk IO. This means it should not block SIGKILL, and also (I think) that if it is killed it should release all resources on the way out.

Consider a server that needs to talk to an underlying layer using paths. If we make it do the mapping between inode numbers and paths, which would be the most convenient situation for the kernel, then there is a fair bit of work. We could do that in the library. This cache will grow unboundedly, unless the server is informed when inodes are released. So, we possibly need to pass the put_inode call back to userspace so that the userspace task can let them go. olduserfs has this.

TODO: Have a way for the server to decline to answer particular requests? Up front, or for each call?

We might also think about having callbacks to let the userspace app query the dcache.

If we let the server override dentry operations, then they could for example implement d_compare to make directories that are not case-sensitive. That might also be necessary to make sure the dcache is not too keen in retaining our entries.

http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html for hacker docs.

Can we detect recursion deadlocks? If all the processes that are listening on the pipe have blocked waiting for us, then something's definitely wrong. This won't detect cycles of more than one process. I wonder if the kernel has a deadlock cycle detection algorithm?

Can we avoid worrying about the page cache, and just use read/write. I think that will work for many things, including perhaps exec, but possibly not mmap.

Filehandles

Imagine a filesystem that wants to return a filehandle for e.g. a TCP socket. We might want to implement a /net/tcp filesystem, like on Plan9 for example. Can we override the open operation on the filesystem? If so, how do we communicate the new file back to the kernel?

Applications

Write-behind replicating filesystem.

Writable CD-ROMS that modify a delta database.

Rewrite smbfs to use this and build on smbclient.

rsyncfs.

zipfs/tarfs.

A versioning filesystem like ClearCase. I think somebody tried to do this for Linux, but over NFS.

Strange Ideas

Possibly allow files to be mapped into the memory space of the server process, so that e.g. files memory mapped by the client process simply become shared segments with the server. This is fine, but I'm not sure if it would be very useful.

Perhaps the protocol should be clean enough to allow clients to be ported to an implementation of this on top of BSD or Solaris...

daniel talks about a way to eventually allow the kernel to make callbacks directly into userspace. This might be faster, but it's not necessary.

Thanks

Daniel. Tridge. tpot.

Related Projects

userfs

The original Linux userfs was written by Jeremy Fitzhardinge.

At startup, userfs queries the server process to see what kinds of requests it wants to handle. Non-handled requests get the default behavior in the kernel, which avoids context switches.

GNU Hurd

A major feature of the GNU Hurd is that it allows per-process filesystems provided by userspace processes. See in particular Towards a New Strategy of OS Design, which is a general overview of the motivation and architecture. The Hurd-talk slides give a bit more detail with some code examples.

The Hurd goes much further of course, with even authentication of processes handled by out-of-kernel servers. But pluggable filesystems seem to be the most studly feature of the Hurd. It would be interesting to measure performance on Linux compared to the Hurd. The Hurd uses Mach ports, which are very vaguely similar to pipes.

The Hurd allows you to interpose translators *on top of* files. I guess you could do that here, although the kernel doesn't know anything about it directly.

Windows NT

Windows NT has "filesystem redirectors", which I think perform a similar function.

GNOME VFS

Here is a short introduction.

Does all this in userspace between cooperating processes.

LD_PRELOAD hacks

e.g. for Samba. Inflicts a virtual filesystem on unsuspecting processes. Can't cover everything, in particular mmap and exec.

FiST

site

A research tool for writing stackable filesytems that can plug into multiple kernels! (BSD, Solaris, Linux.) Refers to lofs, nullfs and wrapfs, which might be interesting.

They have some interesting example filesystems to write:

snoopfs: log access-control failures

cryptfs: encrypt files. also encrypt filenames, then uuencode them to make sure they're valid in the underlying filesystem.

aclfs: implement ACLs stored in .acl in each directory.

unionfs.

Plan 9 from Bell Labs

$Id: index.latte,v 1.15 2001/10/16 04:20:54 mbp Exp $