A good day; things were interesting across the board, and I even asked questions and bugged speakers. I ended up having lunch with Mike, Selena Deckelmann, and others; Selena is a good presenter, and fun company for lunch. Every time slot had something interesting, and the “Ten Worst Inventions” presentation was priceless beyond my ability to blog it. I was too busy laughing. If you didn’t see it, make a point of finding it when the presentations are released in non-streaming form.
Talking to a few other people there (delegates, rather than presenters), about CPU affinity, power and thermal management it’s interesting how many people look like a light went on in their head when I explain that I’m interested in a lot of those things not because I want to safe power, but because the kind of optimisation you do to be smarter about clustering tasks to reduce your thermal profile (by shutting down processors and cores selectively if you can group processes into the minimum number of cores, for example) are the kinds of things that also have huge benefits in a heavily virtualised environment, like my zLinux guests. After all, if you pack your processes more densly for higher utilisation, your hypervisor can schedule you on fewer processors, allowing better utilisation of the system as a whole.
The ceph presentation got me quite excited, not least because of the aforementioned great lunch with Selena after her talk so my PostgreSQL neurons were firing. As well as the inherant coolness of ceph as an open-source scalable distributed filesystem, two features caught my eye:
- ceph allows you to bypass putting a cooked filesystem on top of its object store; you can access objects directly. pg stores its tables, indicies, and other database constructs in big files sitting in directories. It seems like it should (in theory) be quite a nice mapping for pg to store that same information in ceph objects, getting the benefits of ceph (reliability, massively scalable filesystem, massive iop rates, which DBs generally care about).
- You could start small with special cases, such large objects.
- In and of itself that is maybe not so interesting, but there’s an ability to have plugins on ceph storage servers which can access data inside the objects. Sage’s example was of things like a JPEG plugin that would allow a thumbnailing, image rotation and the like to be farmed out to the storage servers by a front-end flickr type app. From my POV a database would be even more interesting on that front: and index object, which has a definiable number of copies in the ceph storage cluster, could have true parallel query access if you had a PostreSQL plugin to ceph. Your storage nodes would potentially accelerate your performance massively.
As theorycrafting it’s interesting but, of course, there’s the small problem that my C skills are way to lousy to hack on the second or probably even the first myself. I am nonetheless keen to play with ceph anyway. Finding a stack of cheap old PCs to stick in a corner for same, now that’s the thing.
A final thought on the talks: something that would help filesystem authors get testers, I think, would be the ability to mirror data across filesystems. Linux only really does block-level mirroring, which means that if you want to try ext4, or NilFS, or btrfs, for example, you have to be prepared to risk your data between backup cycles. If something goes wrong, well, whoops. This is a pretty big impediment, and something that feeds into Ted’s talk on why it takes so much time and effort to get datacentre quality filesystems ready. If I had a FUSE driver which could sit above two different filesystems on two different block devices, this would be a lot easier - I could, for example, have my ‘safe’ ext3 /home-backup, with btrfs /home, and be sure that every write hit my stable old ext3 partition as well as exercising my shiny new btrfs one. This doesn’t seem to be a capability of FUSE.
(And yes, it would be slow and painful compared to simple filesystem FS, but it could still open up a bigger pool of early adopters.)