a Scaleable A/B testing backend in ~100 lines of code (and for free*)

(updated: 2016-05-07)

tip-toeing on the shoulders of giants

Before I dive into the reasons for writing Gimel in the first place, I’d like to cover what it’s based on. Clearly, 100 lines of code won’t get you that far on their own. There are two (or three) essential components this backend is running on, which makes it scalable and also light-weight in terms of actual code:

AWS Lambda (and Amazon API Gateway) – handle the requests to both store experiment data and to return the experiment results.
Redis – using Sets and HyperLogLog data structures to store the experiment data. It provides an extremely efficient memory footprint and great performance.

For free?

As for the “for free” aspect – it largely depends on your scale, but even pretty busy experiments with lots of data are likely to cost a few cents in AWS costs, and can probably fit on a free plan with one of the many redis hosting providers. Even if you splash out for a paid plan (and if you plan to run this for anything important, you should), we’re talking a range of $10 per month. And this thing should scale really well. In comparison, the keen.io “Growth” plan is priced at $125 for 1 million events per month. 1 million events still fit within the free bracket on AWS as far as I can tell. At the “Business” plan for keen.io we’re looking at $1000 for 15m events per month. The same volume on AWS would cost less than $100 according to my estimate. So we’re looking at 10x cheaper. Granted, Keen.io gives you much more, so we’re not comparing apples to apples here. But for the purpose of having an A/B testing backend – this is just as good in my opinion.

Other components

Besides the backend itself, there are two other components to running an A/B test platform: The A/B test client and the dashboard (to display the results). Products like Optimizely and many others also include a fourth component – A WYSIWYG editor to build your experiments. This is deliberately not part of my solution, which is built for developers in mind. I wrote a little about it before, so won’t go into this now.

A/B Test client

I wrote Alephbet a while ago to scratch an itch. I’ve been using it in production ever since and am quite happy with it. A small adapter is all it takes to point Alephbet to use the new backend (Alephbet comes with a default adapter for Google Analytics, and also Keen.io). I plan to add this adapter to Alephbet to make it even easier to use in the near future.. Alephbet includes a built-in tracking adapter for Gimel.

Dashboard

To view (and analyze statistically) the results of the experiments, I’m using a very neat and short snippet hosted on codepen (~20 lines of coffeescript). It uses the Abba javascript library to do the heavy lifting. It’s not super-shiny, but anyone with a bit of HTML/CSS knowledge can build a fancy dashboard if they need it. For my needs, this gives me exactly what I need and nothing more.

Motivation

So now that the high level details are out, let’s go over the motivation for creating the backend in the first place.

As I mentioned, I’m using Alephbet to run my A/B test experiments. I’ve used Google Analytics and Keen.io as backends and both do their job pretty well. But neither of them was perfect.

Google Analytics wins on price (it’s free!), and perhaps on performance and scalability, but loses on technical limitations. First of all, there was a big delay between pushing data and getting it out. Then, getting out the data into some kind of a dashboard was also a pretty big hack (this screencast shows how I’m using tabletop.js + a custom google spreadsheet addon). Another big limitation was the way Google treats unique events. In some cases, unique events were necessary – to avoid duplicates. But in other cases, I wanted to allow duplicates for certain goals. So I had to tweak the dashboard to know which goal was unique and which wasn’t… There was no way to bake this in. The last straw was the fact that Analytics starts sampling your data after a certain volume. I’m not sure I fully grasp how their sampling works and felt like it reduced the accuracy of the results (but had no easy way of quantifying this more precisely).

Keen.io was pretty good actually. I was happy overall. But given that we compared the price we pay against Analytics, it felt like paying too much for too little tangible benefit. It’s not a very logical standpoint I suppose. But that’s what it was.

I was looking at several other potential backends I could use for Alephbet, but most of them either didn’t deal well with unique events, or didn’t allow an easy way to aggregate data.

If I’m totally honest, the main reason was not to save a bit of money, or find a better solution. I was just itching to use AWS Lambda / Redis HyperLogLog (the post by Salvatore Sanfilippo, the creator of redis was inspiring). I was going on holiday and, despite having little time to sit down and code, wanted to tinker with something. Maybe it’s a lousy reason, or a great one. Time will tell. Anecdotally, the few other open source projects I started were also “holiday” projects (Alephbet, Smugline. Giraffe was finished on my honeymoon believe it or not! …)

Architecture

Let’s dig a bit deeper into the solution architecture, and talk about some of the benefits, limitations and potential alternatives.

Client

I suggest looking at Alephbet to get more details, but at a high level, the client runs on the end-user browser. It will randomly pick a variant and execute a javascript function to ‘activate’ it. When a goal is reached — user performs a certain action, this also include the pseudo-goal of participating in the experiment — then an event is sent to the backend. An event typically looks something like “experiment ABC, variant red, user participated”, or “experiment XYZ, variant blue, check out goal reached”.

One important aspect of the client is that it tries to avoid losing events. An example of losing an event would be when a user clicks on a link, and the javascript client doesn’t have a chance to send the event to the backend before the browser window switches to a different page. To fight against this, the client stores events in LocalStorage first (or in a cookie), and then it ‘flushes’ the queue. The client knows that an event was received when the backend returns a 200 OK and can then remove the event from the queue. Whilst this (mostly) solves the lost events issue, it creates a different potential problem: duplicate events. In many cases, the client sends the event to the backend, and the backend receives it, but the browser redirects just before the response arrives, or doesn’t let the callback execute to remove the event from the queue. So as the chances of lost events gets much lower, the chances of duplicate events becomes higher (a classic false positive / false negative trade-off). To protect against this, each event is allocated a random uuid by the client. This uuid is sent with the event to the backend. The backend then needs to be able to de-duplicate those uuids when it counts events.

Data Store – Redis HyperLogLog

I needed a data store to keep a tally of each event that comes into the system. As I mentioned above, being able to count unique events (de-duplication) was important to keep an accurate count. One approach would be to store each event in an entry / database row / document, and then run some kind of a unique count on it. Or we could use a nifty algorithm called HyperLogLog. I won’t go into the details of this algorithm, because many others did this much better than I ever could (and I’m not even sure I fully grasp how it works in much detail either). The basic premise however is that you trade space for accuracy. HyperLogLog allows you to count unique counts without storing each and every item. The trade-off here is that the unique count might not be 100% accurate however, but pretty close.

What I liked about the redis HyperLogLog implementation was therefore the standard error rate of 0.81% – which seems very reasonable to me. Given that we’re talking about many different browsers under different conditions, it’s hard to expect high degree of accuracy in the first place. I doubt the events reaching the system count for all actual events. Consider for example, a browser plugin or an ad-blocker, or even simple cases of users visiting the website and closing the browser without returning – some events will be lost. So this margin feels very acceptable. I’m far from being an expert on statistics, but since we’re looking for statistical significance anyway, we can probably overcome this margin of error by tweaking our expected p-value. I’d love to hear from someone who understands this better who might be able to clarify or suggest the best approach.

Another advantage of the redis implementation was the optimization for low cardinality (low number of unique elements) – both in terms of memory/space and accuracy as far as I understood. Some experiment goals will have a relatively small number of events after all.

In terms of storage space, redis HyperLogLog offers a fixed size of 12k per counter. A typical experiment might have, let’s say 10-20 counters? (5-10 goals x 2 variants). So we’re talking about < 300k of memory for each experiment. That’s virtually nothing. Many redis hosting providers will offer 10Mb-30Mb for free. Paid plans start at around 100Mb or so. This gives us ample space for storing experiment data.

In terms of performance – redis is hard to beat there either.

Alternatives that I considered were Postgresql (which also has a HyperLogLog extension, but not baked-in, and from what I could tell a little less accurate/optimized). Hosting PG was the main drawback for me, and getting things up and running was a bit more involved than with redis. Keen.io offers running queries like unique count as well, but I couldn’t find many platforms that offer a similar option (I looked at DynamoDB, Parse, Firebase and a few others). Google Analytics doesn’t really fit well either. It does offer you a unique vs non-unique counts for events. However, the definition of unique is not event-specific, but rather action-specific. It seems like a subtle difference, but it is very important in practice.

Backend – AWS Lambda / API Gateway

Once I chose redis HyperLogLog as the core data store, all I needed was to build a thin API backend around it. The backend had to take care of a few simple types of requests:

track an event – receive a (HTTP) request with some json data — experiment name, variant, goal and uuid, and then push it to redis.
extract the counters for a specific experiment, or all experiments into some json that can be presented on the dashboard.

Note that given that redis stores our data and takes care of de-duplication for us as well, we can easily spread the backend across more than one server. We could load balance across them, or use DNS round-robin or any other technique to avoid a single point of failure and improve performance. The main consideration would be the latency between our redis instance and the backend.

There are TONS of alternatives here, and all of their pros and cons. I considered a simple PHP script on a shared hosting plan (I personally like webfaction and can recommend it). Using Flask / Sinatra. I was also toying with the idea of playing around with OpenResty (Lua/Nginx), and I’m eager to learn and play around with Elixir / Phoenix as well.

At the end I chose AWS Lambda / API Gateway because it offered a scalable solution out of the box, close proximity to and low latency to many redis hosting providers, and — probably the main reason — I wanted to play around with it. The Pay-as-you-go affordable pricing was also a big consideration. Zero upfront cost is quite appealing.

The main downside to Lambda / API Gateway for me was making it hard to “package” the backend and offer it to other people. I definitely plan to share the code on github or elsewhere. But the README on how to get things going on AWS is probably going to be 3 times as long as the code itself, and explaining all the moving parts isn’t so easy. Compared to a PHP script or an app for <insert your platform here>, it’s a PITA.

I can automate the deployment process easily enough with AWS CLI or something, but sharing it with others isn’t easy. Especially since it requires a redis server and therefore being able to configure the redis host, port, password etc. AWS Lambda unfortunately doesn’t support environment variables or something similar to pull configuration data from.

Given how little the code does, it’s easy to port it to many other platforms however, so there’s fairly little lock-in to stick to AWS Lambda forever.

Limitations

I touched on most of those already, but it might be a good idea to spell them out clearly. There are several limitations or downsides to using this backend:

Packaging – ~~AWS Lambda doesn’t make it that easy to package this as an open source solution you can easily install and run. If I update the code, it’s difficult to upgrade it.~~ Gimel now comes with a built-in CLI for easy installation (gimel deploy). You can get up and running in 5 minutes.
Integration – I really wanted to make this backend the default for Alephbet, or even encourage people to use it over Google Analytics. But the integration, configuration, setup and maintenance of the whole thing makes it much harder. If you’re already using GA then you can pretty much use Alephbet by including one javascript file. An extra backend, even if hosted for you still requires a lot of effort to get up and running.
Accuracy – HyperLogLog means your results will have a margin of error (standard error of 0.81% using the redis implementation).

Benefits

Cheap – more or less free for small scale setups, and very competitive for larger ones.
Scalable – aws lambda + redis are a good combo. redis is still a single point of failure, but with a bit of work you can cluster or shard it.
Small footprint – both code and data store are tiny, even for large experiments
Focused – it does one thing, and does it well (I hope). Not a general-purpose analytics solution or a backend for any type of event. It’s focused on A/B tests.
Open Source – combined with Alephbet, it’s available to use, contribute to or modify freely (as in freedom)
Portable – redis is very easy to install, configure and host (or use many different hosting providers for). With a bit of work, the same data can be stored on Postgresql or other data stores that support HyperLogLog. AWS Lambda is much more vendor locked, but the code itself should be very easy to port to any type of web backend.
Quick start – gimel CLI can get all the AWS wiring done for you with a simple gimel deploy command. It will even produce a Javascript snippet that you could copy&paste to run your A/B tests straight away. More info.

Code

The code is available on Github at https://github.com/Alephbet/gimel

6 replies on “a Scaleable A/B testing backend in ~100 lines of code (and for free*)”

Good job, I have a similar architecture in mind to collect analytics events globally. A/B test data is a part of all web events.
Did you think about adding Google BigQuery as a backend ? The lambda method could streams events into a BigQuery table and then the reporting even realtime could easily be done running SQL queries.
Cost would be super low and scalability is native like Lamba.
Also you could store raw data and build analytics reporting per user.
Like you I didn’t find open source solution and think that building an open source Google Analytics Premium like tracking solution based on Gateway / Lambda / BigQuery could do the job at a fraction of the cost which is around 150 000$ / year.

This is so cool. Great job.

Are you considering adding a more “slick” dashboard?

Thanks Brandon,

Yes! I definitely do. Unfortunately my front-end design skills are very limited, so hope someone (you?) can help. Meanwhile I’m relying on very basic codepen snippets. It does the trick of displaying results, but it’s far from ideal…

Cheers,
Yoav

Thanks – this is a good read and simple enough for me to follow :P

Given that the data’s size isn’t a driver here, the only thing you want to remember when picking your Redis plan is the number of concurrent connections that your plan provides. Lamba’s nature is such that it can spike quite easily, whereas most of the smaller Redis plans have a quota on connection resources.

Shalom Itamar,

Very good point! The connection limit is definitely something to be aware of and plan to match your scale.

I am actually playing around with a few ideas to mitigate this and improve overall throughput (e.g. Kinesis, SQS – to batch those writes together), as well as using Google BigQuery as an alternative backend…

Ludo / Itamar – please take a look at a follow-up post at http://blog.gingerlime.com/2016/a-scalable-analytics-backend-with-google-bigquery-aws-lambda-and-kinesis/