The response time paradox

Where I work, we have a number (let’s say around 100) of backend services that are used by front-end code to render our website. These services are usually centered around one part of our businesses – we have, for example, a service for managing users and getting information about them, another one for getting order information, etc. We have a lot of monitoring on these services, mostly through NewRelic, that makes us feel like we have pretty good visibility into them most of the time.

It isn’t uncommon for one of these services to have spikes in calls that range from a few minutes to a few hours – usually because of some batch process or maybe a spike in web traffic or something. Sometimes there is an interesting phenomenon that occurs – despite a large spike in traffic, response times appear to go DOWN. What is even more perplexing, every endpoint in the service seems to be responding slower due to the load, but overall response time appears to decrease.

It’s interesting to think about why this is the case. If you get a bunch of smart developers and operations people into a room, a bunch of theories will start to come out. Maybe the extra requests are keeping a specific thread hot and so there is less context switching? Maybe something is being cached and re-used repeatedly? Maybe an increase in L2 cache hits? And so on…

Some of these may have some element of truth to them, but it’s hard to imagine any of them improving response times by any measurable amount. The answer, it turns out (at least in every case I’ve looked into so far), is a little surprising to some people – the services aren’t actually responding faster, they are responding slower, and in some cases quite a bit slower. The way we report response times misleads us and makes us think otherwise, however.

To illustrate, let’s imagine a service that manages users, and that service has two endpoints: /read for getting information about a user, and /write for creating users. Our read endpoint is really fast and called quite frequently, but our write endpoint does more work and is much slower, but is called relatively infrequently. Assume that under normal circumstances, our read endpoint is called 100 times per minute and takes 10 ms to respond, whereas our write endpoint is called 5 times per minute and takes a whopping 500 ms to respond.

What is our mean response time? Well, ((100*10) + (5*500)) / 105 = 33 ms.

Now, what happens if our read traffic goes up by 10x for a minute, while the write traffic stays the same, and due to increased load, everything is 20% slower?

((1000*12) + (5*600)) / 1005 = 15 ms.

It looks like our response time has dropped in half, even though it has gotten 20% worse in reality. This is a fairly typical result.

There are two fairly easy criticisms you can make:

1) You shouldn’t be aggregating across endpoints

This is sort of fair, but at some point, you’ll have so many endpoints that doing so at that level makes it really hard to get useful information. It is also useful to aggregate at the service or process level because that is one of the data points you are using to gauge how they are doing and know when to scale them. Finally, we can have the same situation happen on a single endpoint – what if most of our users were really fast, but a small subset were power users who had a lot more data attached to them and thus loaded a lot more slowly? This is a scenario I’ve run into before.

2) The whole problem is that you are using the mean

This is also a good point – anyone with even a fairly basic background in statistics can tell you why using the mean may not be the best way to measure aggregate response times.

Using the median might make situations like this less likely (but you can still contrive examples of them), but hides data in its own way – it is easy, for example, for it to hide the fact that you have a number of really slow endpoints if you have slightly more fast ones.

Using percentiles or a histogram of response times avoid most of these problems, but require a lot more data to be kept around versus calculating the mean on the fly (i suspect this is why NewRelic shows the mean by default and only retains limited history of other metrics). Indeed, the 99% metric is the only one that has actually shown load times increasing when we’ve had these events in the past, although some of the spikes have been so large that even the 99% isn’t sensitive enough.

tl;dr Sometimes the way metrics are reported can be really misleading in extreme cases, and if something feels counterintuitive, sometimes it really isn’t exactly as it seems. I generally prefer to look at the 95/99% graph when I’m looking at a service – it tells you if you really have stuff you need to look into and often gives you an idea of what the future looks like for your service.

2017 Tax Legislation Calculator

With both the Senate and the House proposing substantial tax changes in the last few months of 2017, there has been a lot of discussion around the various plans and taxes in general. It would be almost impossible to craft tax legislation without some groups of people coming out better than other groups, and the current crop of legislation is no exception.

Unfortunately, these tax changes – and taxes in general – are very difficult to fully understand, and lots of people have been discussing these changes using information that is incomplete and/or incorrect. For example, it is easy to look at the reduction in tax rates but miss the fact that the personal exemption is gone. I’ve also seen lots of people using tax brackets incorrectly in popular Twitter posts.

I have my own personal views on these proposed pieces of legislation, but what I think about them isn’t very interesting. My goal, rather, is to help better inform the debate around them, because I think it is important to have these discussions with information that is as full and accurate as possible.

I’ve been working on a bunch of content that I’ll be releasing over the coming week, including blog posts explaining various parts of the tax system, some interactive calculators, and more. Tonight, I’m releasing an early version of my first interactive calculator that you can use to see how the proposed changes to the tax code would affect you. You can view it here.

This calculator is far from complete (I’m just one guy and there is limited time to build something like this, since the legislation is being considered now), but it does a decent job of including major changes that would impact most people. It is pretty simple and bare-bones, but I am working on a number of features that will hopefully not just educate you as to how much your taxes will change, but why.

Please check it out and let me know what you think. I’d love to hear about stuff I’m missing and, especially, how I could make it more useful for you.

CloudStudio for AWS now available

I’m excited to announce the initial release of a project I’ve been working on – CloudStudio.

Amazon’s AWS provides a ton of great services, but not all of them have easy UIs for testing and interacting with them. As a frequent user of services like Kinesis and Firehose, I’ve often wished I could have an easy way to push a message onto a stream, or take a peek at messages coming in, so I started building a suite of tools to just do that.

This initial release focuses on Kinesis, Firehose, and SQS. You can send messages to all 3, and you can poll messages from Kinesis, all with a simple and easy-to-use GUI. It is available now for Mac for $9.99, but I’m wrapping up work on Windows and Linux versions in the next week or so as well.

It may be starting out simple, but I have some really big plans to make it even more useful in the near future, by doing things like allowing you to stream to Kinesis from a file, tail an S3 file, and more. I’ve made my roadmap open on Trello, and I’ve added a forum to solicit feedback on what features you want.

I want this to be a really powerful tool for AWS users, so I hope you’ll check it out here and give me some feedback on how I can make it more useful for you.

Running a single Mocha test file in VS Code

I’ve been doing a lot of Node development lately, after doing mostly Java the past 10 years. There are lots of comparisons between the two, and I’ve come away with a better understanding of where one is better than the other and what I wish I could take from each one.

One thing I miss about doing Java development is being able to right-click in a single test, run it, and be able to easily debug it. You can get close to this in Node-land, but it is nowhere near as simple or seamless. You can, however, get kind of close with VS Code (which I’m loving more and more every day) by creating a custom launch configuration that lets you debug a single mocha file.

1. Create a launch configuration that only runs the current file

This creates a launch configuration that passes the current file to the mocha command.

2. If you have a mocha.opts, you may need to override it

A lot of projects have a mocha.opts file that has something like this ‘–recursive test/.’

Command-line args should override options in mocha.opts, but it looks like the file specification part does not get overridden. So, what I did was create a dummy mocha-debug.opts that is empty, then point to it in the config:

3. You can now run and debug a single Mocha file.

Betterment substantially increases fees

I recently became familiar with Betterment since my employer switched to them for our 401k provider. I started looking into the services they provided, and became really intrigued by their automated investing and tax-loss harvesting. I’ve usually stuck with Vanguard and their low fees, but with a wrap fee of .15% if you had over $100k invested with them, it was tempting to try Betterment, since in theory, at least, the tax-loss harvesting would more than pay for the additional fees.

Getting all of my investments to them took a fair amount of time and money (though they made the process as easy as possible), and I was excited when I got my emails this morning saying my last 2 big accounts had been received by them.

Less than half an hour later, I got another email titled, innocuously, “New Betterment service plans for 2017”. Reading through the email, they discussed their new options that would allow you to use the services of a CFP, which is odd, given their pitch about automated investing, but not a big deal. Then, tucked down 4 paragraphs is the real reason for the changes:

Each plan will cost a simple, flat rate. Starting June 1, your Digital plan will be 0.25% per year of your average balance.

For accounts with over $100,000 in them, this represents an increase of 67% (from .15% to .25%)! And Betterment tries to make it as low-key as possible that they are increasing fees on us by a huge amount and not offering anything in return. I am really disappointed, both in the increase in fees and in the way they announced it – I had a high opinion of the company before this.

At this point, it looks like WealthFront is a better option. Both do a lot of the same things and offer similar features at a fixed .25% fee, but WealthFront manages the first $10,000 for free and offers a Direct Indexing service that lets you avoid ETF fees, making the combined fee substantially less than Betterment.

Fees should be getting cheaper as companies like this get more assets under management, not going higher by almost 70%, and companies should be more forthright in the way they raise fees.