For a little while, I have occasionally noticed that my pod will not send emails out when it's supposed to. If I navigated to the Sidekiq monitor in the admin area, and reprocessed all of the Dead jobs (or at least the email related jobs), the emails would come through. Upon further inspection, I found that it was due to a 'Too many open files' error. This post originally started as a post on my pod so, I'll quote that first:
I kept finding ‘Too many open files’ in my Sidekiq dead jobs and in the logs. They were mostly related to getaddrinfo (169 entries) or sendmail (223 entries). Also, emails would fail to send somewhat often from my pod. Took a look at this wiki entry (and the linked article) and hoping I will stop seeing those ‘too many open files’ entries in the logs.
Side note, I didn’t find any ‘…open files’ entries in sidekiq.log itself, I ended up running
grep "Too many open files" log/*.logto find the entries.
Check for out of open files errors
In the diaspora* app folder, do
grep "Too many open files" log/sidekiq.logand check for results relating to the system user running the diaspora server running out of open files. If you get any errors, it is possible sidekiq can be brought down by the same problem. Follow for example this guide to increase the limit of open files available to the user running diaspora*.
There are a handful of comments in the post as well, read them as you wish... One thing I do want to point out though. Jason Robinson mentioned that my pod is the second one that he knows of that has had this problem. It doesn't appear that currently, this is not a widespread issue.
That fix went in and a little later, I went to bed. The next day after work, I dug into the production logs again and found that roughly 18 hours after putting in the fix, I found some 'Too many open files' entries again. Did a little checking, the diaspora user had the limits that I set but, I found that the actual diaspora process was being limited much lower. Grr.
- OS: CentOS 7
- Init: systemd (based on this service file)
- Backend: MySQL (hosted on a different host)
- Branch: develop
After making the above changes, I could see that the hard and soft limits for the diaspora user were what I had set.
# su - poduser -c 'ulimit -aHS' -s '/bin/bash' ... open files (-n) 524288 ...
However, looking at the actual parent PID for the pod, I found that the limit was quite a bit lower than set. Found the PID with:
# systemctl status diaspora.service ... Main PID: 30264 (bash) ...
# cat /proc/30264/limits Limit Soft Limit Hard Limit Units ... Max open files 1024 4096 files ...
The systemd service file needed a little edit within the
After restarting the pod and running the
cat /proc/XXX/limits command again:
# cat /proc/XXX/limits Limit Soft Limit Hard Limit Units ... Max open files 524288 524288 files ...
Bam! Now things are looking better. I will be adjusting the various limits as time goes on yet. Hopefully, this fixes the issue for good but, at the time of this writing, it's too early to tell just yet. I'll update this post if it doesn't. If it does help, w00t!
Now it's time to crack open a cold one. Cheers!
EDIT: I'm posting this here mostly for my own benefit. Here's a bash script that will email me the current open file count for the diaspora user: link
EDIT 2015.08.02: After watching this open file issue for a bit, I found that the bulk of the "open files" were sidekiq connections hanging out in CLOSE_WAIT status. To get around this, I wrote a little bash script to restart sidekiq and a cron entry to do that daily. With the current state of Diaspora, I am only using the stop function of the bash script and letting the main D* process restart sidekiq on it's own.