Overview and features

Introduction

Py-zmq-pipeline is as high level pipeline pattern implemented in Python_ with ZeroMQ_.

The ZeroMQ Pipeline pattern is a data and task distribution pattern for distributed and parallel computing. The py-zmq-pipeline implementation abstracts away many details of ZeroMQ while keeping all the performance advantages the framework offers.

This library focuses more on task distribution than data distribution, since distributing data creates a memory bottleneck and tends not to scale well.

Client workers are responsible for both retrieval and processing of data in this implementation.

Features

A good pipeline pattern implementation will use load balancing to ensure all workers are supplied with roughly the same amount of work. While ZeroMQ provides fair-queuing and uniform request distribution out of the box, it does not automatically provide load balancing. Py-zmq-pipeline does have load balancing built in automatically.

Additional features include:

  • Built-in single and multi-threaded workers

  • Task dependencies
    • Allows you to construct dependency trees of tasks
  • Templated design pattern
    • This means you inherit a class, provide your own implementation and invoke a run() method
  • High performance with low overhead (benchmarks available in examples)

  • Reliability
    • workers can die and be restarted without interrupting overall workflow.
    • repeatedly sent tasks are not re-processed once acknowledged
  • Fast serialization
    • uses msgpack for highly efficient, dense and flexible packing of information, allowing py-zmq-pipeline to have a minimal footprint on the wire.
  • Load-balanced

  • Built in logging support for easier debugging

Overview

There are 4 components in py-zmq-pipeline:

  • A Distributor class, responsible for pushing registered tasks to worker clients
  • A Task class, an encapsulation of work that needs to be done and configures the distributor to do it
  • A Worker class, a class that consumes computational resources to execute a given task instance
  • A Collector class, a sink that accepts receipts of completed work and sends ACKs (acknowledgements) back to the distributor

Under the server / client paradigm the distributor, task and collector are server-side entities, while the worker is a client entity.