January 12, 2021 | 22:20

Cert Monitoring

Everyone knows the alert when visiting a website with expired certificates. At least once a month I stumble into one or I receive tickets with questions asking what to do. “Nothing” is my reply in most cases. “The mistake is on the other side”. On this occasion, my very special appreciation to the owners and/or administrators of such sites for the extra work required.

The obvious solution to avoid such embarrassments: A software or service with periodic checking and notifications. Sounds obvious, but unfortunately doesn’t always work.

Here is an overview of recent years with failures to renew certificates properly:

  • 2020/06 Sectigo Root-Cert1
  • 2020/02 Microsoft Teams2
  • 2019/10 Apple App-Store3
  • 2019/08 Twitter4
  • 2019/05 Mozilla Add-Ons5
  • 2018/04 Microsoft Azure6

And these are just examples of big techs. The many smaller websites remain below the radar. So why is this going so badly wrong?

Non-renewed certificates are no small issues and don’t just affect websites. Many apps on mobile devices stop working when their JSON counterparts7 are no longer accessible. And frankly, most users click away the warnings mechanically, without knowing what they have clicked onto. The unpleasant surprise might come a little bit later, when among the many away-clicked warnings a malware asked for elevated rights. Let’s look at some figures illustrating how a company is affected by non-renewed certs.

The TUI Group, which is operating worldwide, generates up to 20,000 video conferences and 250,000 chat messages in a workday8. According to the data, the Microsoft Teams outage on February, 4th 2020 between 3pm and 6pm hampered 7,500 conferences and, at its mildest, disrupted timelines. At its worst, important decisions could not be made and deadlines were not met. Unfortunately a figure for the number of angry business partners, suppliers or customers does not exist.

Launched in 20179, the MS Teams certificate is renewed annually. Two successful renewals so far have been followed by a missed one. Even without a degree in statistics, that’s a miserable quota because a certificate renewal doesn’t just drop unexpectedly out of the blue.

If the success of any given measure is close to the average distribution10, you really should abandon it. Working by chances or a bunch of monkeys, either work as reliably as your given measure.

Why renewals fail?

During my research I was unfortunately unable to find a single source that could tell more about the contributing factors. So without claiming completeness or accuracy, and based on my personal experience, I have drafted this list:

a) Leadership failure: Nobody feels responsible.
b) Organisational failure: Lack of processes and overview of certs within an organisation.
c) Sloppiness due to existing monitoring.
d) Negligence due to lack of monitoring. Blind trust in auto-renewal processes.
e) Deliberate action, sabotage

The striking common characteristic: All of them are non-technical factors.

This sounds rather discouraging, but it is not entirely hopeless. As a secret weapon I have something I have been using since the 90ies: A source code version

Even the best software in the world cannot turn bad leadership or mis-mmanagement for the better.

Git + Gitea + Monit

Last year I featured Prometheus11 and Grafana12 on this blog as tools within my monitoring stack. Today, Monit13 follows which I use for regular checks of websites, APIs and certificates.

Why exactly Monit? Firstly, queries and rules can be spread wonderfully over different configuration files. Secondly, it is very powerful. And thirdly, it is very simple in the formal description of rules thanks to a BASIC-like14 syntax. For instance, this is the rule for monitoring the certificate of this blog:

check host blog.jakobs.systems with address blog.jakobs.systems
  if failed 
	port 443 
	protocol https 
	certificate valid > 5 days
  then alert

The magic lies in the interaction of Monit with git15 and Gitea16. All Monit instances obtain their configuration automatically from a git repository17. Management and scaling are done via the web interfaces of Gitea and do not require any programming knowledge. Quite convienient: The documentation is included in the repository as a byproduct because I write it there in Markdown18.

Screenshot Gitea

A place has been found for all the information on certificates in an organisation. Changes do not remain in the hands of a few. Every web developer can for instance add new hosts or make changes to existing ones. It’s hardly possible to break anything, because on one hand. every change can be rolled back. On the other hand, serious mistakes are immediately found in the mutual peer review process of pull requests19. Should a defective configuration find its way to any Monit instance, Monit itself immediately will notify me and I still can intervene.

The combination of Monit, Git and Gitea allows a living, scalable system with a continuous improvement process.

Resilience

Without a few words on resilience20 it is impossible to continue. Unfortunately there are always networks I encounter where the entire IT infrastructure is monitored from a single internal host. Well you might do that though you create unnecessary and self-inflicted problems. Combined with the typical “single points of failure”21 represented by mail servers, firewalls or switches, measures should be taken immediately.

Power grids are successfully operated with the (n-1)rule22. This means that for every element in a system there must be at least one redundant spare element. Above-ground power lines for example, have at least two control systems, each of which is operated with only 50% load. If one control system fails, the remaining one can quickly take over 100% of the load.

Screenshot og my 3 Monit-Instances

This rule also improves resilience in IT. The screenshot shows my three monitoring instances distributed across Germany (data centre, office, home office) at different ISPs (Anexia backbone, Vodafone, Telekom) to keep an eye not only on my own hosts and certificates but also on those of my customers.

There are reasons why monitoring is not always the same.

I would be happy to help by providing my know-how.
Sustainable, transparent, fair and with open source.

Stay healthy,
Tomas Jakobs


  1. https://www.heise.de/news/AddTrust-Probleme-durch-abgelaufenes-Root-Zertifkat-4771717.html ↩︎

  2. https://www.heise.de/newsticker/meldung/Microsoft-Teams-Ausfall-wegen-Zertifikatsablauf-4652527.html ↩︎

  3. https://www.heise.de/mac-and-i/meldung/Abgelaufene-Sicherheitszertifikate-Wie-man-jetzt-neue-macOS-Installer-findet-4569853.html ↩︎

  4. https://www.pcworld.com/article/201980/article.html ↩︎

  5. https://www.heise.de/newsticker/meldung/Zertifikat-abgelaufen-Firefox-deaktiviert-Add-ons-4413170.html ↩︎

  6. https://www.borncity.com/blog/2018/04/18/zertifikatsprobleme-bei-microsoft-seiten/ ↩︎

  7. https://de.wikipedia.org/wiki/JavaScript_Object_Notation ↩︎

  8. https://www.tuigroup.com/en-en/media/stories/2021/2021-01-07-ms-teams-and-corona ↩︎

  9. https://en.wikipedia.org/wiki/Microsoft_Teams ↩︎

  10. https://de.wikipedia.org/wiki/Normalverteilung ↩︎

  11. https://blog.jakobs.systems/blog/20201025-monitoring-prometheus/ ↩︎

  12. https://grafana.com/ ↩︎

  13. https://www.mmonit.com/monit ↩︎

  14. https://de.wikipedia.org/wiki/BASIC ↩︎

  15. https://de.wikipedia.org/wiki/Git ↩︎

  16. https://gitea.io/en-us/ ↩︎

  17. https://en.wikipedia.org/wiki/Repository_(version_control) ↩︎

  18. https://de.wikipedia.org/wiki/Markdown ↩︎

  19. https://de.wikipedia.org/wiki/Pull_Request ↩︎

  20. https://de.wikipedia.org/wiki/Resilienz-Management ↩︎

  21. https://de.wikipedia.org/wiki/Single_Point_of_Failure ↩︎

  22. https://de.wikipedia.org/wiki/(n_%E2%80%93_1)-Regel ↩︎

© 2021 Tomas Jakobs - Imprint and Legal Notice