It was a typical Saturday as I spent the day doing housework and running errands. Everything was going fine until I got home that afternoon and didn't receive my usual notifications from my digitial assistant, Zoee, letting me know that the garage door had been opened and closed.
The thing is, I really didn't even notice the lack of notifications until much later that evening when my wife came home from a night out with friends.
I'll talk about Zoee more in a different post (she essentially replaced T.I.T.U.S.), but in short, I have a handful of microservices running on a Ubuntu box in my office that assist me in various ways, such as keeping an eye on the state of my garage doors.
When a door is opened or closed, I get a SMS message delivered to my phone letting me know the state of the door. I also set up a speech synthesis module utilizing Amazon's Polly service that runs on a Raspberry Pi wired into my home stereo, so I get verbal notifications of the same events, as well.
It was about 10:30 when my wife pulled in, and after she walked in the door, I realized I never heard Zoee's voice come over the stereo. This happens for various reasons from time to time, and when it does, I usually check to see if I got an accompanying text message.
If I do receive a text message, then I know the problem is more isolated and lies within the speech synthesis module.
If I don't receive a text message, though, which was the case on Saturday, I know there's more of a systemic issue I need to dig into, as it's likely none of my services are working at that point.
Typically, when this happens, it's because of a networking issue and all services and devices cease to connect to my local server. Simply restarting a few things gets everything back up and running in just a few short minutes.
This time, however, things were different.
When I logged into my RabbitMQ administration tool, I saw that all of my connections had issues, so no traffic was flowing through the broker at all:
Wondering why all of my connections were either blocked or blocking, I switched over to the dashboard view and saw a nice, big, red square indicating that I had zero disk space left:
"How in the hell did this happen?" I asked myself.
"I have a 300GB disk. What is filling it up?"
Since I'm not as well-versed in Linux as I am in Windows systems, I took to Google to find a good tool for analyzing disk usage.
I found a number of people using a tool called NCDU, so I decided to give that a shot.
After poking around for a while at different volumes and directories, I narrowed it down to my Docker containers:
As you can see in the image above, I had one container in particular consuming a whopping 191.8 gigs, and the next largest container was taking up an additional 60. Those two containers combined were eating up about 84% of my disk!
"What the fuck?!"
Digging into each of those directories, I found that what was actually consuming the space in each one was a JSON log file:
Great. I figured out where the problem was, but questions I still had were: "Why was Docker producing these massive log files?" and "How could I limit their sizes?"
A quick Google search of "Docker JSON log" yielded a number of results, the first of which had the answers to both of my questions.
The very first paragraph answered my first question (why can't all Googling go this smoothly?!):
"By default, Docker captures the standard output (and standard error) of all your containers, and writes them in files using the JSON format. The JSON format annotates each line with its origin (
stderr) and its timestamp. Each log file contains information about only one container."
That's great that Docker turns this on by default, because these logs can be really helpful in tracking down issues.
What I have an issue with, however, is the fact that Docker does this without any sort of limit on the log files themselves.
Now, you could argue that I should have done more research and gone through more training before utilizing Docker, because if I had, I probably would have known this.
But, that's not how I operate, and this was just for a fun home automation project, so fuck that shit.
Additionally, as a developer myself, I wouldn't (intentionally) allow a feature such as this to go unchecked, knowing damn well that it could eventually consume the entire disk, which is exactly what happened to me just six months after building this container.
And the kicker is, this is just for a home automation project on my own, personal server! Can you imagine the impact this would have on an enterprise system with tons of users and many more containers?
Limiting the Logs
If you look a little further down in that document, you'll see the answer to my second question on how to limit the size of these files.
It turns out that you can do this in a number of ways, but I took the global approach by applying the limit in the daemon.json config file in /etc/docker. I just used the config settings from their examples, so now, each of my containers are able to create up to three rolling log files at 10 MB apiece, which is still quite a bit of data for helping me track down any issues, and I can always allow bigger files later with more log data if need be.
Of course, your mileage may vary, so you'll want to evaluate your specific needs and set these limits accordingly.
The downside to making these config changes was that these settings wouldn't take effect on the containers until I recreated each one of them. Just restarting the containers doesn't work. You have to actually stop, remove, and create/run the containers again for the changes to take effect.
Luckily for me, this didn't take very long, but I did have eighteen containers that I had to recreate - each with different run parameters - so it was still kind of annoying.
On the plus side, once I removed the containers, my disk cleared right up, and RabbitMQ was back up and running again, allowing my garage door messages (among others) to flow through Zoee once again.
.NET Core Shares the Blame
I will cut Docker a little slack, because .NET Core is to blame as well, since most of my containers run .NET Core applications, and that's what's doing the majority of the logging.
A handful of the applications, such as the one that chewed up ~192 gigs of disk space, utilize Entity Framework Core to work with a local MariaDB database, and from what I can tell, EF Core is one chatty motherfucker!
By running the docker logs command and giving it the ID for one of my problematic containers, I could see exactly what was being logged to the massive JSON log files:
I will say, though, that I would rather have too much information over not enough any day of the week, but I may need to look into toning the logging down a little bit nonetheless.
At this point, I'm six days out from my Docker container meltdown, and my server is humming along just fine (as far as I know, anyway).
Hopefully, this sheds some light on one of the "gotchas" of using Docker containers for your applications, and you can now avoid the "container constipation" that I experienced.
(Header image credit: Julien Delaunay on Unsplash)