Celery vs kafka


  • Asynchronous Task Queue with Django, Celery and AWS SQS
  • Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?
  • A comparison of data processing frameworks
  • Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime
  • Does not address the observed issue where Celery workers stop processing tasks Does not address the issue with harakiri-induced connection churn Upgrade versions Might improve the issue where RabbitMQ becomes stuck in a degraded state Might improve the issue where Celery workers get stuck Might buy us headroom to implement a longer term strategy Not guaranteed to fix our observed bugs Will not immediately fix our issues with availability, scalability, observability, and operational efficiency Newer versions of RabbitMQ and Celery required newer versions of Python.

    Does not address the issue with harakiri-induced connection churn Custom Kafka solution Kafka can be highly available Kafka is horizontally scalable Improved observability with Kakfa as the broker Improved operational efficiency A broker change is straight-foward Harakiri connection churn does not significantly degrade Kafka performance Addresses the observed issue where Celery workers stop processing tasks Requires more work to implement than all the other options Despite in-house experience, we had not operated Kafka at scale at DoorDash Our strategy for onboarding Kafka Given our required system uptime, we devised our onboarding strategy based on the following principles to maximize the reliability benefits in the shortest amount of time.

    This strategy involved three steps: Hitting the ground running: We wanted to leverage the basics of the solution we were building as we were iterating on other parts of it. We liken this strategy to driving a race car while swapping in a new fuel pump. Design choices for a seamless adoption by developers: We wanted to minimize wasted effort on the part of all developers that may have resulted from defining a different interface. Incremental rollout with zero downtime: Instead of a big flashy release being tested in the wild for the first time with a higher chance of failures, we focused on shipping smaller independent features that could be individually tested in the wild over a longer period of time.

    Hitting the ground running Switching to Kafka represented a major technical change in our stack, but one that was sorely needed. We did not have time to waste since every week we were losing business due to the instability of our legacy RabbitMQ solution. Our first and foremost priority was to create a minimum viable product MVP to bring us interim stability and give us the headroom needed to iterate and prepare for a more comprehensive solution with wider adoption.

    Our MVP consisted of producers that published task Fully Qualified Names FQNs and pickled arguments to Kafka while our consumers read those messages, imported the tasks from the FQN, and executed them synchronously with the specified arguments. Design choices for a seamless adoption by developers Sometimes, developer adoption is a greater challenge than development. Now the same interface could be used to write tasks for both systems.

    With these decisions in place, engineering teams had to do no additional work to integrate with the new system, barring implementing a single feature flag. We wanted to roll out our system as soon as our MVP was ready, but it did not yet support all the same features as Celery. Celery allows users to configure their tasks with parameters in their task annotation or when they submit their task.

    To allow us to launch more quickly, we created a whitelist of compatible parameters and chose to support the smallest number of features needed to support a majority of tasks. Figure 2: We rapidly ramped up task volume to the Kafka-based MVP, starting with low-risk and low-priority tasks first. Some of these were tasks that ran at off-peak hours, which explains the spikes of the metric depicted above. We dealt with our primary problem of outages quickly, and over the course of the project supported more and more esoteric features to enable execution of the remaining tasks.

    Incremental rollout, zero downtime The ability to switch Kafka clusters and switch between RabbitMQ and Kafka dynamically without business impact was extremely important to us. This ability also helped us in a variety of operations such as cluster maintenance, load shedding, and gradual migrations.

    To implement this rollout, we utilized dynamic feature flags both at the message submission level as well as at the message consumption side. The cost of being fully dynamic here was to keep our worker fleet running at double capacity.

    Half of this fleet was devoted to RabbitMQ, and the rest to Kafka. Running the worker fleet at double capacity was definitely taxing on our infrastructure. At one point we even spun up a completely new Kubernetes cluster just to house all of our workers. During the initial phase of development, this flexibility served us well. Once we had more confidence in our new system, we looked at ways to reduce the load on our infrastructure, such as running multiple consuming processes per worker machine.

    As we transitioned various topics over, we were able to start reducing the worker counts for RabbitMQ while maintaining a small reserve capacity. No solution is perfect, iterate as needed With our MVP in production, we had the headroom needed to iterate on and polish our product. We ranked every missing Celery feature by the number of tasks that used it to help us decide which ones to implement first. Features used by only a few tasks were not implemented in our custom solution.

    Instead, we re-wrote those tasks to not use that specific feature. With this strategy, we eventually moved all tasks off Celery. If a message in a single partition takes too long to be processed, it will stall consumption of all messages behind it in that partition, as seen in Figure 3, below.

    This problem can be particularly disastrous in the case of a high-priority topic. We want to be able to continue to process messages in a partition in the event that a delay happens.

    Other partitions would continue to process as expected. While parallelism is, fundamentally, a Python problem, the concepts of this solution are applicable to other languages as well. Our solution, depicted in Figure 4, below, was to house one Kafka-consumer process and multiple task-execution processes per worker.

    The Kafka-consumer process is responsible for fetching messages from Kafka, and placing them on a local queue that is read by the task-execution processes. It continues consuming till the local queue hits a user-defined threshold. This solution allows messages in the partition to flow and only one task-execution process will be stalled by the slow message. The threshold also limits the number of in-flight messages in the local queue which may get lost in the event of a system crash.

    Figure 4: Our non-blocking Kafka Worker consists of a local message queue and two types of processes: a kafka-consumer process and multiple task-executor processes. This diagram shows that a slow-processing message in red only blocks a single task-executor till it completes, while other messages behind it in the partition continue to be processed by other task-executors.

    The disruptiveness of deploys We deploy our Django app multiple times a day. One drawback with our solution that we noticed is that a deployment triggers a rebalance of partition assignments in Kafka.

    Despite using a different consumer group per topic to limit the rebalance scope, deployments still caused a momentary slowdown in message processing as task consumption had to stop during rebalancing.

    Slowdowns may be acceptable in most cases when we perform planned releases, but can be catastrophic when, for example, we're doing an emergency release to hotfix a bug. The consequence would be the introduction of a cascading processing slowdown. Newer versions of Kafka and clients support incremental cooperative rebalancing, which would massively reduce the operational impact of a rebalance.

    Upgrading our clients to support this type of rebalancing would be our solution of choice going forward. Unfortunately, incremental cooperative rebalancing is not yet supported in our chosen Kafka client yet.

    Key wins With the conclusion of this project, we realized significant improvements in terms of uptime, scalability, observability, and decentralization. These wins were crucial to ensure the continued growth of our business. No more repeated outages We stopped the repeated outages almost as soon as we started rolling out this custom Kafka approach. Outages were resulting in extremely poor user experiences.

    By implementing only a small subset of the most used Celery features in our MVP we were able to ship working code to production in two weeks. With the MVP in place we were able to significantly reduce the load on RabbitMQ and Celery as we continued to harden our solution and implement new features. Task processing was no longer the limiting factor for growth With Kafka at the heart of our architecture, we built a task processing system that is highly available and horizontally scalable, allowing DoorDash and its customers to continue their growth.

    Massively augmented observability Since this was a custom solution, we were able to bake in more metrics at almost every level. Each queue, worker, and task was fully observable at a very granular level in production and development environments.

    This increased observability was a huge win not only in a production sense but also in terms of developer productivity. Operational decentralization With the observability improvements, we were able to templatize our alerts as Terraform modules and explicitly assign owners to every single topic and, implicitly, all plus tasks.

    A detailed operating guide for the task processing system makes information accessible for all engineers to debug operational issues with their topics and workers as well as perform overall Kafka cluster-management operations, as needed.

    Day-to-day operations are self-serve and support is rarely ever needed from our Infrastructure team. Conclusion To summarize, we hit the ceiling of our ability to scale RabbitMQ and had to look for alternatives. The alternative we went with was a custom Kafka-based solution.

    While there are some drawbacks to using Kafka, we found a number of workarounds, described above. When critical workflows heavily rely on asynchronous task processing, ensuring scalability is of the utmost importance. This strategy, in the general case, is a tactical approach to quickly mitigate reliability issues and buy sorely needed time for a more robust and strategic solution.

    Photo by tian kuan on Unsplash.

    When using asynchronous communication for Microservices, it is common to use a message broker. There are a few message brokers you can choose from, varying in scale and data capabilities.

    Microservices Communication: Synchronous and Asynchronous There are two common ways Microservices communicate with each other: Synchronous and Asynchronous. On the contrary, in an Asynchronous communication the messages are sent without waiting for a response.

    This is suited for distributed systems, and usually requires a message broker to manage the messages. The type of communication you choose should consider different parameters, such as how you structure your Microservices, what infrastructure you have in place, latency, scale, dependencies and the purpose of the communication. Asynchronous communication may be more complicated to establish and requires adding more components to stack, but the advantages of using Asynchronous communication for Microservices outweigh the cons.

    Asynchronous Communication Advantages First and foremost, asynchronous communication is non-blocking by definition. It also supports better scaling than Synchronous operations. Third, in the event Microservice crashes, Asynchronous communication mechanisms provide various recovery techniques and is generally better at handling errors pertaining to the crash.

    A new service can even be introduced after an old one has been running for a long time, i. Finally, when choosing Asynchronous operations, you increase your capability of creating a central discovery, monitoring, load balancing, or even policy enforcer in the future.

    This will provide you with abilities for flexibility, scalability and more capabilities in your code and system building. Choosing the Right Message Broker Asynchronous communication is usually manages through a message broker. When choosing a broker for executing your asynchronous operations, you should consider a few things: Broker Scale — The number of messages sent per second in the system. Data Persistency — The ability to recover messages. One-to-One One-to-Many We checked out the latest and greatest services out there in order to find out which provider is the strongest within these three categories.

    Comparing Different Message Brokers Scale: based on configuration and resources, the ballpark here is around 50K msg per second.

    Persistency: both persistent and transient messages are supported. One-to-one vs one-to-many consumers: both. RabbitMQ was released in and is one of the first common message brokers to be created. RabbitMQ supports all major languages, including Python, Java,. Expect some performance issues when in persistent mode. Scale: can send up to a millions messages per second. Persistency: yes. One-to-one vs one-to-many consumers: only one-to-many seems strange at first glance, right?!

    Kafka was created by Linkedin in to handle high throughput, low latency processing. As a distributed streaming platform, Kafka replicates a publish-subscribe service.

    It provides data persistency and stores streams of records that render it capable of exchanging quality messages. They are all the creators and main contributors of the Kafka project.

    Scale: can send up to a million messages per second. Redis is a bit different from the other message brokers. At its core, Redis is an in-memory data store that can be used as either a high-performance key-value store or as a message broker. Originally, Redis was not one-to-one and one-to-many. However, since Redis 5. All three are beasts in their category, but as described, they operate quite differently.

    Here is our recommendation for the right message broker to use according to different use cases. With the release of Redis streams in 5. Kafka is ideal for one to many use cases where persistency is required. Complex Routing: RabbitMQ RabbitMQ is an older, yet mature broker with a lot of features and capabilities that support complex routing. Consider Your Software Stack The final consideration, of course, is your current software stack.

    We at Otonomo have used all the above through our platform evolution and growth and then some! More for Developers Otonomo is more than a car data exchange. Read these blogs written by developers, for developers, about coding, technology and culture.

    Does not address the issue with harakiri-induced connection churn Custom Kafka solution Kafka can be highly available Kafka is horizontally scalable Improved observability with Kakfa as the broker Improved operational efficiency A broker change is straight-foward Harakiri connection churn does not significantly degrade Kafka performance Addresses the observed issue where Celery workers stop processing tasks Requires more work to implement than all the other options Despite in-house experience, we had not operated Kafka at scale at DoorDash Our strategy for onboarding Kafka Given our required system uptime, we devised our onboarding strategy based on the following principles to maximize the reliability benefits in the shortest amount of time.

    This strategy involved three steps: Hitting the ground running: We wanted to leverage the basics of the solution we were building as we were iterating on other parts of it. We liken this strategy to driving a race car while swapping in a new fuel pump. Design choices for a seamless adoption by developers: We wanted to minimize wasted effort on the part of all developers that may have resulted from defining a different interface.

    Incremental rollout with zero downtime: Instead of a big flashy release being tested in the wild for the first time with a higher chance of failures, we focused on shipping smaller independent features that could be individually tested in the wild over a longer period of time.

    Hitting the ground running Switching to Kafka represented a major technical change in our stack, but one that was sorely needed. We did not have time to waste since every week we were losing business due to the instability of our legacy RabbitMQ solution. Our first and foremost priority was to create a minimum viable product MVP to bring us interim stability and give us the headroom needed to iterate and prepare for a more comprehensive solution with wider adoption. Our MVP consisted of producers that published task Fully Qualified Names FQNs and pickled arguments to Kafka while our consumers read those messages, imported the tasks from the FQN, and executed them synchronously with the specified arguments.

    Design choices for a seamless adoption by developers Sometimes, developer adoption is a greater challenge than development. Now the same interface could be used to write tasks for both systems. With these decisions in place, engineering teams had to do no additional work to integrate with the new system, barring implementing a single feature flag. We wanted to roll out our system as soon as our MVP was ready, but it did not yet support all the same features as Celery. Celery allows users to configure their tasks with parameters in their task annotation or when they submit their task.

    To allow us to launch more quickly, we created a whitelist of compatible parameters and chose to support the smallest number of features needed to support a majority of tasks.

    Figure 2: We rapidly ramped up task volume to the Kafka-based MVP, starting with low-risk and low-priority tasks first. Some of these were tasks that ran at off-peak hours, which explains the spikes of the metric depicted above. We dealt with our primary problem of outages quickly, and whonix vs tails the course of the project supported more and more esoteric features to enable execution of the remaining tasks.

    Incremental rollout, zero downtime The ability to switch Kafka clusters and switch between RabbitMQ and Kafka dynamically without business impact was extremely important to us. This ability also helped us in a variety of operations such as cluster maintenance, load shedding, and gradual migrations. To implement this rollout, we utilized dynamic feature flags both at the message submission level as well as at the message consumption side. The cost of being fully dynamic here was to keep our worker fleet running at double capacity.

    Asynchronous Task Queue with Django, Celery and AWS SQS

    Half of this fleet was devoted to RabbitMQ, and the rest to Kafka. Running the worker fleet at double capacity was definitely taxing on our infrastructure.

    Components description Message broker Message broker usage First things first. What is a message broker? Our always friendly Wikipedia says that: It mediates communication among applications, minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling.

    Besides that, you do have the option of deploying your message broker on EC2 instances, which could be cheaper depending on the number of messages. But you should also consider the time and effort required to set it up on EC2 and the extra effort in case you decide to use it with Auto Scaling.

    What I see as the main benefit of using SQS is that it is fully managed by AWS and you pay on-demand according to your usage and if your application or feature requirements are according to SQS usage, it is very straightforward to configure and use it. Celery Celery library logo What is Celery?

    Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?

    It is focused on real-time operation, but supports scheduling as well. It allows the possibility of moving a specific code execution outside of the HTTP request-response cycle.

    This way, your server can respond as quickly as possible for a specific request, as it spawns an asynchronous job as a worker to execute the specific piece of code, which improves your server response time. Also, with Celery you have the option of running jobs in the background on a regular schedule.

    A comparison of data processing frameworks

    Putting things together To get everybody on the same page: AWS SQS is the message broker chosen here and Celery is the component responsible for orchestrating — consume: read and write — the message queue. Since Celery is a library, it needs to be set up on top of Django. In practice, these message inboxes are like task queues. Another expectation for these systems is that tasks be independent from one another. This means that they can be and almost always are processed in parallel by multiple identical consumers, usually referred to as workers.

    This property also enables independent failure, which is a good feature for many workloads. As an example, being unable to process a payment from one user maybe because of missing profile information or other trivial problems would not stop the whole payment processing pipeline for all users.

    It is common practice to use RabbitMQ through frameworks that offer an easy way to implement various retry policies e. The simpler version of this pattern task queues can also be implemented using Redis Lists directly. Redis has blocking and atomic operations that make building bespoke solutions very easy.

    A special mention goes to Kuewhich uses Redis in a nifty implementation of task queues for JavaScript. Streams are an immutable, append-only series of time-directed entries, and many types of data fit naturally into that format. They also have fairly regular structure since they tend to keep the same set of fieldsa property that streams can exploit for better space efficiency.

    Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime

    This type of data fits well in a stream because the most direct way of accessing the data is by retrieving a given time range, which streams can do in a very efficient way. Back to our communication use case, all streams implementations also allow clients to tail a stream, receiving live updates as new entries get added.

    This is sometimes called observing or subscribing to the stream. It is also possible — and sometimes preferable — to implement service-to-service communication over streams, entering the realm of streaming architectures.

    There are many subtle implications from this change in design. As an example, you can add new services later and have them go through the whole stream history. In queues, this is not possible because tasks get deleted once completed and the way communication is generally expressed in those systems does now allow for this think imperative vs.

    Streams have a dual nature: data structure and communication pattern. Some data fits into this naturally e. The practice of fully embracing this dual nature is called event sourcing. With event sourcing, you define your business models as an endless stream of events and let the business logic and other services react to it.


    thoughts on “Celery vs kafka

    Leave a Reply

    Your email address will not be published. Required fields are marked *