Distributed observability in Ray
As distributed computing systems grow over time, what begins as a small set of comprehensible operations balloons into an inscrutable and interconnected web of transformations, business logic, infrastructure automation, networking, and all sorts of other components that hide system information from its engineers and operators. This information, which is generated by disparate and decoupled components, often describes implicit relations between those components and the data they generate. The health of that system depends on the accessibility and legibility of that data and the ease of which the system's operators can access that data.
To implement effective observability in our Ray deployment, we selected a suite of tools that address the different facets of observability—metrics, tracing, log aggregation, and alerting.
1. Metrics with Prometheus + Grafana
- Prometheus: We chose Prometheus for its robust time-series database and scraping capabilities. Prometheus collects metrics from Ray nodes, including CPU, memory, and task metrics, using a pull-based approach.
- Grafana: For visualizing these metrics, we paired Prometheus with Grafana, a popular open-source analytics platform.
2. Tracing with Jaeger/Otel
- Jaeger: Jaeger was our tracing tool of choice, providing end-to-end distributed tracing. It captures and visualizes traces across Ray components.
- OpenTelemetry (Otel): Otel serves as the instrumentation layer, enabling seamless integration of tracing across Ray components.
3. Log aggregation with Promtail + Loki
- Promtail: Promtail serves as an agent to collect logs from Ray components and forward them to Loki.
- Loki: Loki is a log aggregation system optimized for massive log volumes. It provides a "grep-like" experience with built-in integration to Grafana.
4. Alerting with Robusta
- Robusta: We chose Robusta for alerting, given its ability to manage complex alert rules, automatically enrich alerts with additional context, and trigger predefined remediation actions. It integrates well with Prometheus and Grafana.