Thoughts on observability
Everything is complicated, even those things that seem flat in their bleakness.
Debugging microservices application based on scarce information is one of those cases that I don't wish anyone. But it is how it is at my current project, so management started to put some measures in motion.
I reaserched topic a bit at work and a bit on my own and I have something to share - OpenTelemetry is the future. Bu it is still work in progress.
In this post I will tell you everything I learned.
Observability intro
When it comes to first (and most important) pillar of observability - logs, the first little revolution came with the invention of Mapped Diagnostic Context. Neil Harrison described this method in the book Patterns for Logging Diagnostic Messages in Pattern Languages of Program Design 3, edited by R. Martin, D. Riehle, and F. Buschmann (Addison-Wesley, 1997).
The beauty of the idea is it's simplicity. You put collection of key value pairs in thread local storage and implicitly append them to logs each time the log message is created. As you know ids, names and other stuff can "span" multiple nested method calls and this really makes life a little bit easier. Beacause it 2025 all logging framework support it, but that's not all.
Taking it one step further you could propagate this key values across threads and (eventually) service boundaries. Welcome to distributed tracing. Spring Boot covers all this stuff semi-automatically all you have to do is to configure proper tools (Zipkin-Brave or OTEL).
And finally golden standard Prometheus scrapes at given metrics exported from your Spring app on given endpoint and acts as a store for Grafana to display.
These were the 3 components of Observability: Metrics, Logs and Traces.
What the future will bring
Not so long ago Open Telemetry emerged as a standard for application instrumentation with all 3 observability components. State of providers is quite good - 40+ languages covered more or less exporting to various sinks like ELK stack, Grafana and Zipkin/Jaeger.
But to be honest I didn't find any easy to use out of the box tool to visualize these goods. Yes there is Grafana with extensions for Loki (logs) and tempo db (traces) but the documentation has holes and you can't just set everything in a day. Some Open Source dashboards exist to my knowledge but they are a little bit clunky to use and limited.
As usual complex stuff requires time and time is money. So I would look for the innovations in the commercial side of the topic. I even have my favourite Lighstep - now ServiceNow Cloud Observability is one such solution that can do it well. They market it as
- possible to go all between logs traces and metrics
- use AI to diagnose problems (interesting)
But I'm not an architect and not planning to put my own money on the table so this has stay in the land of dreams for now.
How I done POC
At work I tried to setup Spring to send traces to Zipkin as Proof of Concept and it costed me a little bit of effort. Because there are 2 bridges (adapters) and 2 exporters there is a little bit of problem - mix any of the 2 up and it will break. Oh how I wish it would just break with explicit error! Instead it just does not send logs and go figure why.
Luckily I started with OTEL first and when I switched I saw mixed jars so after exluding otel from parent pom i got finally traces to zipkin. The next step was to actually propagate the trace id across services boundaries. I discovered a cool feature - a baggage. It's a key-value pairs collection that propagetes with traceId/spanId. If you set some config property (like remote-fields or something) you get it sent to other service for free.
As long as you use Spring stack to send requests the MDC gets exported as SimpleTextMap sent in a header.
Conclusion
Future is integration of all three pillars of observability.
In time there eventually be a Open Source solution that enables you to switch between logs traces and metrics from single place and it will be to much of everyones joy.
But for now waiting is all we can do.