SSL – not easy to get right

I’ve been ignoring blogging here for a long time – it’s time to write something!

Customer Support

One of my roles is providing technical support for queries related to my areas of the system which can’t be resolved by our frontline support staff. This usually means either a bug at our end, or some odd issue which hasn’t been seen before.

Malfunctioning software

Here’s something interesting that came up with a customer query last week. They were using XFCE’s built in “biff” equivalent, but it was failing to list folders. In the default configuration it only counts UNREAD messages in INBOX, but you can ask for a server folder listing and add additional folders to watch.

The user’s problem was that upon opening the folder list, the program would sit saying “loading” forever. It worked fine with their gmail account using the same “imap” backend, so they wanted to know why it didn’t work with our server.

Red herrings

My initial debugging step, in a situation like this, is to enable protocol debugging on our Cyrus IMAP server. This is done by creating a folder with the user’s username in $confdir/log/. Upon seeing this directory, Cyrus daemons will open a file per process and log all protocol traffic after the login to the file.

The debugging showed that the mailwatcher plugin uses multiple list commands, like this (listing from my own account):

00006 LIST "" "%"
* LIST (\HasChildren) "." INBOX
* LIST (\Noselect \HasChildren) "." RESTORED
* LIST (\Noselect \HasChildren) "." user
00006 OK Completed (0.020 secs 225 calls)
00007 LIST "INBOX." "%"
* LIST (\HasNoChildren) "." INBOX.AANotify
* LIST (\HasNoChildren) "." INBOX.AATemp
* LIST (\HasNoChildren) "." INBOX.AAUnprocessed
* LIST (\HasNoChildren) "." INBOX.Account
* LIST (\HasChildren) "." INBOX.Archive
* LIST (\HasNoChildren) "." INBOX.Chats
* LIST (\HasNoChildren) "." INBOX.Drafts
[...]

The debug logs on the server showed it completing the initial top level listing of the INBOX, but then the client wasn’t listing the subfolders as expected.

Another one of our team also uses the same program, and he said it was working for him – but he didn’t have subfolders.

Another potential candidate was a folder named INBOX._dup_ in the user’s namespace. The underscores were a potential piece of weirdness.

I had a look at the source code for xfce4-mailwatch-plugin online and couldn’t see any obvious causes, so I recommended the user file a bug with XFCE and left it there.

A deeper look

But I do run XFCE – and something annoying me with their terminal app a few days later convinced me I should sign up with the XFCE bug tracker and clone a couple of their git repositories. While I was doing that, it made sense to check out the mailwatch program again too.

I tried it – and amazingly enough, my own mailbox had the same issue.

So I built from git and started debugging. There are two things which are very useful here. One is the compile-time option --enable-debug=yes which makes DBG() statements in the code print to stderr. The other is switching the XFCE panel itself to debug mode. This is documented here: http://wiki.xfce.org/howto/panel_plugin_debug

Debugging showed that it was parsing just some of the INBOX folder listing before freezing for long enough to obviously be a timeout, and saying “(nada)” returned. The code shows this means that a zero length response was returned by imap_recv.

GnuTLS and SSL

We don’t allow non-encrypted IMAP to our servers because unencrypted email => identity theft in this day and age, so every test was over SSL. On a whim, I created an ssh tunnel directly to the backend server with my mail, and tried setting up an account without SSL to the tunnel port. This time it worked fine!

Woah. There was already a voice in the back of my head saying “buffers” – particularly since it always failed at the same folder and there was nothing special about the name.

So I looked some more at GnuTLS documentation. And hit a falacy that I already knew. It’s one of the worst things about working with SSL. We use openssl in Cyrus, but the same conditions hold with GnuTLS.

  1. If you write encrypted data into the decryption function, it doesn’t mean anything will come out (possibly multiple times)
  2. If there’s nothing readable on the underlying socket, it doesn’t mean there’s nothing still in the decrypter buffer.
  3. The output may be larger than the input.

In this case, xfce4-maliwatch-plugin fails to account for item 2. It checks for data to read from the underlying filehandle, and if there is none, it never even checks if there’s any more to read from the decrypter function.

Buffer sizes

In this case, my folder listing was 1280 bytes, which fits into a single ethernet packet. But the output buffer was only 1024 bytes – so the gnutls decoder had extra data left after the first read. This is a common occurrence in C code where arbitrary fixed size buffers are the norm, but it’s going to happen anywhere at some point. You can’t have infinitely expanding buffers – and see point three above, sometimes the output is larger than the input.

Working around

The ideal solution is to use non-blocking gnutls and only select if you have already had a zero byte read from the decrypter. There’s more in this thread http://lists.gnu.org/archive/html/gnutls-devel/2011-05/msg00028.html.

Their basic point is – if you’ve already had a zero byte read from the pipeline, you’re not going to get any more out unless you put something more in first! Fair enough.

So the real solution would be to track zero byte reads and only THEN go back to selecting on the file handle.

But there’s a really skanky workaround possible – just bump the output buffer to 4096 bytes. Big enough that any likely single packet will fit comfortably. It’s not a real solution, but I proposed it on the bug anyway:

https://bugzilla.xfce.org/show_bug.cgi?id=9284

For all that it’s not a “real” solution, it will work around the bug in one very simple change.

Next steps

Since I’m a sucker for fixing things, I’m going to have another look at the code and see how hard the real fix is.

Still, it’s amazing how often a simple workaround is “good enough”. If this was affecting our production systems, I would have rolled out the buffer size bump immediately to avoid the symptoms while working on a proper cure.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s