Concurrency Master - davidvgalbraith

Lest you think I’m only good for one-line bug fixes in large existing projects, today I’ll share the story of a module I wrote all by myself: the Concurrency Master.

Writing to Elasticsearch

I was working on the Juttle Elastic Adapter‘s write processor. The inputs to write are arrays of Javascript objects, and its job is to store all those objects as documents in Elasticsearch. In version 0.1, write just sent a bulk request every time it received some points. This simple implementation worked when only a few points were at stake, but Team Juttle’s venerable QA team soon tried to stream hundreds of thousands of points from a file to Elasticsearch via write.

With so many points, write created hundreds of HTTP requests and sent them all in parallel. This overwhelmed Elasticsearch’s ability to process requests, and about 30% of the points were dropped due to request timeouts or task queue overflows in Elasticsearch. We needed a better strategy.

Callbacks

The key to Node.js’s rockstar performance is its support for asynchronous programming. Asynchronous programming enables a Node.js program to make a request to another process and perform other tasks while that process handles the request. Then, when the external process is finished, Node.js can pick up where it left off and handle the response. Without asynchronous programming, the Node.js program would have to wait until the external process finishes before doing anything else.

Node.js’s built-in implementation of asynchronous programming revolves around callbacks. A callback is a function that Node.js calls in response to an event, such as a request completing. Here’s a simple example that gets the home page of my blog:

var request = require('request');

request('http://www.davidvgalbraith.com', function callback(err, response) {
    console.log('response received!');
}).end();

console.log('request sent!');

When request completes, callback is called. In the meantime, the Node.js program is free to perform other tasks, like logging request sent!. A second or so passes between request sent! and response received!: almost all that time is spent waiting for the request and response to travel through the network between your computer and the server hosting this blog. A second is a long time for a computer, so it’s good to keep the program free during that time.

Promises

Callbacks are cool, but they don’t provide all the functionality that one could ever want. That’s why Promises were invented. A Promise is an object that wraps an asynchronous operation. The Promise has methods to control the operation. The simplest is .then, which schedules a callback to execute when the operation finishes. There’s also .catch, which handles any errors that happen during the operation; .cancel, which gives up on it altogether; and a host of others.

Furthermore, Promise libraries such as Bluebird offer utilities for handling multiple Promises at once, such as Promise.map, which calls a Promise-returning function on each element in an array, with a concurrency parameter specifying how many of the operations can run in parallel. Here it is in action:

var request = require('request');
var Promise = require('bluebird');

var promisifed_request = Promise.promisify(request);

Promise.map(['http://www.davidvgalbraith.com', 'http://www.google.com', 'http://www.twitter.com'], function(site) {
    console.log('requesting ' + site);
    return promisifed_request(site)
        .then(function(result) {
            console.log(site + ' request done');
        });
}, {concurrency: 1});

promisify is a magic Bluebird function that takes a callback-taking function like request and turns it into a Promise-returning function. By using Promise.map with concurrency 1, this script runs a request for davidvgalbraith.com, google.com and twitter.com, one at a time. Cool!

The Fix

I needed a way to control the number of concurrent requests that the write makes. My first thought was to use Promise.map. Upon further reflection, though, Promise.map was unsuitable: points come into write at irregular intervals, whereas Promise.map has to know up front all the objects in the array it is mapping over.

For instance, if write receives a batch of 10,000 points, I could use Promise.map to split them into ten requests of size 1,000 and run say only 3 of them in parallel. But if 10,000 more points showed up while I was still sending the first 10,000, there’s no way to add requests to the Promise.map handling the original points. Promise.map couldn’t manage the total concurrency of write‘s requests to Elasticsearch.

With some meditation, I came up with an elegant scheme for managing total request concurrency. It looked something like this:

class WriteElastic
    constructor(options) {
        this.write_index = 0;
        this.writes = [];
        for (var i = 0; i < options.concurrency; i++) {
            this.writes.push(Promise.resolve());
        }
    }

    write(points) {
        this.writes[this.write_index] = this.writes[this.write_index].then(() => {
            return this._write(points); // the _write method handles the low-level Elasticsearch request logic
        });
        this.write_index = (this.write_index + 1) % this.concurrency;
    }
}

In this code, the adapter has a fixed-size array with as many elements as it wants to make concurrent requests. Each element in this array is a Promise representing a request to Elasticsearch. When new points come in, the adapter picks one of these Promises and chains a request to insert the new points after the chosen promise. So we don’t start inserting a new set of points until after an old set of points finishes. That’s how we enforce the concurrency requirement despite being unable to predict when points will come in.

Encapsulation: enter the Concurrency Master

Reading the code, it occurred to me that the logic for maintaining these concurrent requests was a bit of complexity that didn’t really belong in the Juttle Elastic Adapter. The Juttle Elastic Adapter just wants to write points to Elasticsearch, and the details of holding this array of Promises and figuring out how to chain them to manage concurrency are more than it bargained for. Furthermore, other adapters that write to databases would likely need similar logic. So I decided to move this code into its own class, and that’s how the ConcurrencyMaster was born. Here’s the Concurrency Master, in all its glory:

class ConcurrencyMaster {
    constructor(concurrency) {
        this.promises = [];
        for (var i = 0; i < concurrency; i++) {
            this.promises.push(Promise.resolve());
        }
        this.promise_index = 0;
        this.concurrency = concurrency;
    }

    add(promise_func) {
        this.promises[this.promise_index] = this.promises[this.promise_index].then(promise_func);
        this.promise_index = (this.promise_index + 1) % this.concurrency;
    }

    wait() {
        return Promise.all(this.promises);
    }
}

The constructor takes a number argument specifying the desired concurrency and sets up its Promise array with this length. Then the add method just takes a Promise-returning function and chains a call to it after one of the array’s Promises. Finally, wait returns a Promise that resolves when every Promise function passed to add has been called and resolved. Here’s write, rewritten to use the Concurrency Master:

class WriteElastic {
    constructor(options) {
        this.concurrency_master = new ConcurrencyMaster(options.concurrency);
    }

    write(points) {
        var self = this;
        var execute_write = function() {
            return self._write(points);
        };
        this.concurrency_master.add(execute_write);
    }
}

We wrap the call to self._write in the function execute_write because the act of calling self._write sends a request to Elasticsearch. The Concurrency Master has to be the one who decides when to make that call.

With this approach, the Juttle Elastic Adapter doesn’t need to know anything about concurrency management, just that the Concurrency Master will take care of it. The Juttle Elastic Adapter can keep its focus on talking with Elasticsearch, while the Concurrency Master handles scheduling requests to keep the pace reasonable. We have two simple classes that each do one thing well, and we have a handy reusable utility to help any other code that needs manage concurrency. Software, engineered!

Writing to Elasticsearch

Callbacks

Promises

The Fix

Encapsulation: enter the Concurrency Master

Leave a Reply Cancel reply