Parallel processing with PHP and Gearman

Some background first

Often you come to a point where you need to process data in a non-blocking fashion and in multiple threads at once. A solution for doing this is Gearman that we also used in one of our recent projects.

This specific project contains a collection of RSS feeds (a few thousand items) that need to be permanently pulled by specific rules. Each time new content is detected in a feed, one or more callback URLs need to be fired. A great part of the RSS items have to be checked every 5 minutes.

The first solution and the fastest, would be to set-up a cron to run every 1 minute to scan the items against the time rules and fetch them sequentially. While this approach seems to be the winner it has a couple of drawbacks, out of which I’ll mention these two:

  1. If one of the feeds gets into a network congestion the script will block, a new one will be started a minute later (by the cron tab) and you will end-up with a lot of zombie PHP processes
  2. Scaling. With a single script that checks each feed at a time, you cannot scale.

So, we went to Gearman.

But what is Gearman ?

Gearman is a framework that provides parallel data processing and load balancing between processes. It comes with implementations for various programming languages, including PHP. Gearman operates with 3 basic concepts:

  1. The Client – the part where the processing requests are issued – this can be a PHP script for instance
  2. The Job Server – this is where the magic happens: it receives requests from clients and dispatch them to registered workers in a load balanced fashion
  3. The Worker – the part of the system that actually process a job – one or more PHP scripts
The main advantage of this architecture is in fact that the components are totally decoupled which allows you to run each of it on separate machines. You can have multiple clients, multiple worker scripts and, why not, multiple job servers. The diagram below shows how the client, the job server and the worker scripts interact:

Putting it all together

Back to our project. We have the RSS feeds stored in a MySQL table. Each feed has a time rule when it needs to be pulled. A cron runs every 5 minutes and decides what feeds need to be pulled.

The feeds that need to fetched are sent one by one to the job server in a non-blocking manner. This way you achieve a better throughput than having the cron to check the feed. Below is is the client part of the Gearman:

$feeds = new Feeds();
//retrieve the feeds from the table
$feeds_list = $feeds->getAllForUpdate();

// set-up the client by connecting to a server
$client = new GearmanClient();
$client->addServer(GEARMAN_SERVER, GEARMAN_PORT);
foreach($feeds_list as $feed_info) {
   $client->addTaskBackground('process_feed', serialize($feed_info));
}
$client->runTasks();

The worker script is daemonized with SupervisorD. SupevisorD is a process control system that takes care of maintaining the desired number of processes for a given script, in our case the worker. A good practice is to start as many worker scripts as the number of processor cores available on the system.

This is how we’ve configured SupervisorD to meet our needs:

[program:feedworker]
process_name=%(process_num)s
command=php worker.php
numprocs=12
directory=/var/local/gearman
autostart=true
autorestart=true

The worker script that actually handles the feed processing looks like this:

$worker = new GearmanWorker();
$worker->addServer();

// Register the same function we invoked on the client
$worker->addFunction("process_feed", function ($_job) {
	$data = unserialize($_job->workload());
	// delegate to a specialized class to process the feed
	$feed = FeedProcessor($data);
	$feed->work();
});

// keep an infinite loop so the script won't die
while(true) {
	$worker->work();
}

The set-up

Our current set-up, which runs on AWS, consists in a machine that runs the  client script and the Gearman server and another machine that runs the worker scripts. The worker scripts machine is part of a scalable set-up so the system can automatically spin up more machines when the processing demand increase.

Before adding Gearman to our set-up we were able to process ~52 feeds/minute with a single php script. The things totally changed with Gearman when we achieved a ~154 feeds/minute processing ratio. Both the single php script and the Gearman workers scripts were tested on the same machine, with the only change that the later was in fact multiplexed in 10 worker processes instead of a single one.

Leave a Reply

Your email address will not be published. Required fields are marked *