I wanted to upgrade my web application by deploying it on AWS ECS (instead of a single server) with the following four services running:
- My Django Application (autoscalable)
- Celery (autoscalable)
- Celery-Beat
- RabbitMQ
The first three use the same code-base which I packed in a Docker Container. RabbitMQ is a custom build from the original RabbitMQ Docker image:
RabbitMQ Image
FROM rabbitmq:3.13
# Install necessary packages for adjusting TCP settings
RUN apt-get update && apt-get install -y procps
# Copy the TCP keepalive configuration file
COPY ./tcp_keepalive.conf /etc/sysctl.d/tcp_keepalive.conf
# Apply TCP keepalive configuration
RUN sysctl -p
# Copy general configuration
COPY ./rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
whereas
tcp_keepalive.conf:
net.ipv4.tcp_fin_timeout=30
net.ipv4.tcp_keepalive_time=30
net.ipv4.tcp_keepalive_intvl=10
net.ipv4.tcp_keepalive_probes=4
net.ipv4.tcp_tw_reuse=1
and rabbitmq_conf:
# consumer timeout
consumer_timeout = 432000000
# TCP Keepalive
tcp_listen_options.keepalive = true
tcp_listen_options.backlog = 128
tcp_listen_options.nodelay = true
tcp_listen_options.exit_on_close = true
# Heartbeat
heartbeat = 60
I have the following CELERY specific settings in my Django Settings File:
# CELERY
CELERY_TIMEZONE = "Europe/Zurich"
CELERY_ACKS_LATE = True
CELERY_BROKER_POOL_LIMIT = 3 # might be decreased to 1?
CELERY_BROKER_CONNECTION_TIMEOUT = 30
CELERY_BROKER_CONNECTION_RETRY = True
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True
CELERY_EVENT_QUEUE_EXPIRES = 60
CELERYD_PREFETCH_MULTIPLIER = 1
CELERYD_CONCURRENCY = 12
CELERY_WORKER_CANCEL_LONG_RUNNING_TASK_ON_CONNECTION_LOSS = True
CELERY_BROKER_HEARTBEAT = 10
CELERY_BROKER_HEARTBEAT_CHECKRATE = 2
CELERY_BROKER_TRANSPORT_OPTIONS = {
'confirm_publish': True,
'max_retries': 3,
'interval_start': 0,
'interval_step': 0.2,
'interval_max': 0.5
}
celery_default_user = os.environ.get('RABBITMQ_DEFAULT_USER', None)
celery_default_pass = os.environ.get('RABBITMQ_DEFAULT_PASS', None)
rabbitmq_domain = os.environ.get('RABBITMQ_DOMAIN_NAME', None)
if celery_default_user and celery_default_pass:
CELERY_BROKER_URL = f'amqp://{celery_default_user}:{celery_default_pass}@{rabbitmq_domain}:5672//'
CELERY_RESULT_BACKEND = f'rpc://{celery_default_user}:{celery_default_pass}@{rabbitmq_domain}:5672//'
else:
CELERY_BROKER_URL = 'amqp://'
CELERY_RESULT_BACKEND = 'rpc://'
I use namespaces for my RabbitMQ instances to make sure they can be found by the Celery Workers and Beat Processes.
This all runs more or less smoothly (had to increase BROKER_CONNECTION_TIMEOUT
to 30
as it seems that even though the instances run in the same availablility zone the connection is rather slow).
From time to time, there arise errors I can’t really narrow down but I have the feeling as it might relate to some connection loss issues between my producer/app, rabbitmq and celery. As I am by no definition a professional in those manners I’m wondering if anyone built a similar setup and can point out where I might have to look deeper.