I’ve used this taxonomy of data to help shape my thinking for many years now.
One of the things that frustrates me the most about Linux is the dotfile madness and caches spread throughout $HOME. Modern Windows is getting much better at offering this separation, but still falls short.
A lack of principled separation of data types leads to suffering.
Without further ado; the three types of data:
- Own Output – creative effort you have produced yourself. In theory you can reproduce it, though anyone who has lost hours of work to a crash knows just how disheartining it can be to repeate yourself.
- Primary Copy of others’ creative output. Unreproducable. Lose this, it’s gone forever.
- Secondary Copy of anything. Cache. Can always be re-fetched (possibly at some cost)
There’s a bit of blurring between categories in practice, particularly because you can sometimes find secondary copies in a disaster and get back primary data you thought was lost. It’s never pretty though. And in converse, sometimes you are the secondary copy which saves the day when a primary goes away.
Given these classifications, it’s easy to reason about some things:
Care of data
Own Output – stick it in version control. Easy. Since the effort that goes into creating is so high compared to data storage cost, there’s no reason to discard historical work. The repository then becomes a “Primary Copy”, and we fall through to;
Primary Copy – back it up. Replicate it. Everything you can to ensure it’s never lost. This stuff is gold.
In FastMail’s case as an email host, it’s other peoples’ precious memories. We store emails on RAID1 on every backend server, and each copy is replicated to two other servers, giving a total of 3 copies on RAID1 or 6 disks in total with a full copy of every message on them.
One of those copies is in a datacentre a third of the distance around the world from the other two.
On top of this, we run nightly backups to a different operating system with different configuration and a different file system.
Secondary Copy – disposable. Who cares. Actually, we do keep backups of Debian package repositories for every package we use just in case we want to reinstall and the mirror is down. And we keep a local cache for fast reinstalls in each datacentre too. But if something happens to them, meh. Re-download.
It’s amazing how much stuff is just cached. For example – Operating System installs. The annoying thing about reinstalling a home computer is that you add a bunch of category 1 (own creative output) to the install. You choose a bunch of options during the install (modern installers are getting better at doing this up-front and then chugging away for half an hour rather than asking something new every 5 minutes, but still).
And it’s still not done. There’s all the other programs to install. Not so much on a Linux system where you just add a list of repositories to apt/yum and then install the package list… but still work. On Windows, it’s a bunch of different installers, each with their own click-through and possibly “enter licence code”.
And then you still have to configure each app to your liking.
Finally, done. You have a system which is 99% cache, 1 percent own creative output. Intermingled. Reinstalling will be just as much work next time. Ouch.
Separation of data types
We fixed that at FastMail by never changing config files directly.. All config goes in templates in git. No ifs, no buts. The process of reinstalling a machine is a clean install with FAI which installs the operating system and then builds the config from git onto the system. Repeatably. Meaning the OS install is 100% cache, and hence disposable.
If I was doing it again today, I would probably build from puppet. Right now we use Makefiles and perl’s Template-Toolkit to generate the configuration files. You can ‘make diff’ to see what’s different between a running machine and the new configuration, then ‘make install’ to upgrade the config and restart the related services.
Finally, user data. It goes on different disk partitions. The default “reinstall” leaves it untouched. The configuration files to access it are built from git, and go on the disposable OS disk. By keeping this separation clear, reinstall is a breezy, and considered “safe at any time”. With failover configurations for every service, it should never take more than 20 minutes to shut down a host, replace the hardware, reinstall the OS and have it back up and running. For hosts which don’t store user data locally (most of them), that’s it!
Log data also goes to a separate partition. It’s kind of “user data” too. Not replicated quite so aggressively as emails, because it’s frankly less precious.
Designing for types of data
So you’ve read this and you’re all excited to make life easier for the users of a piece of software you work on? It all comes down to one thing:
The biggest sin I see is huge configuration files with hundreds of options, where the user is expected to make manual changes to a few lines. Then the next version comes out, and there are a few more options in the default config file, so the user has to manually merge their changes forwards, or be without those changes. No, NO, NO. Bad programmer. The default configuration is “cache” – it comes from somewhere else. The explicit changes the user has requested are “Own Output” for the user, but they are “Primary Copy” to the developer. Breaking them is a crime against the poor sucker who uses your app.
The next worst, and this is endemic in Linux, is storing cached data in $HOME/.appname/cache/ or similar. This is really annoying because it bloats backups. The sucky thing is, there’s nowhere else reliably available. At least keep it separate so the poor user can back up just your $HOME/.appname/ directory and know it’s their local changes, and ignore your $HOME/.cache/$appname/ or $HOME/.appname-cache/ directory
I find this taxonomy provides a lot of insight into the world of data. I would love to hear if there’s any major categorisation I’m missing, or other sensible lines to break data along