Observability Rail
Observability Rail
Section titled “Observability Rail”Platform monitoring, SLO tracking, and distributed tracing.
Overview
Section titled “Overview”The Observability Rail provides endpoints for monitoring platform health, SLO tracking, distributed tracing, and metrics collection.
Base URL
Section titled “Base URL”/api/v1/observabilityEndpoints
Section titled “Endpoints”Get Health Status
Section titled “Get Health Status”GET /api/v1/observability/healthGet platform health status.
Response:
{ "data": { "status": "HEALTHY", "timestamp": "2025-01-15T10:00:00Z", "components": { "api": { "status": "HEALTHY", "latency": 45 }, "database": { "status": "HEALTHY", "latency": 12 }, "cache": { "status": "HEALTHY", "latency": 2 }, "queue": { "status": "HEALTHY", "latency": 5 } }, "version": "2.0.0" }}Get SLO Status
Section titled “Get SLO Status”GET /api/v1/observability/slosGet Service Level Objective status.
Response:
{ "data": [ { "sloId": "api_availability", "name": "API Availability", "target": 0.999, "current": 0.9995, "status": "MET", "period": "30d", "errorBudgetRemaining": 0.0005 }, { "sloId": "api_latency_p99", "name": "API Latency P99", "target": 500, "current": 245, "unit": "ms", "status": "MET", "period": "30d" } ]}Get Metrics
Section titled “Get Metrics”GET /api/v1/observability/metricsGet platform metrics.
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
metric | string | Metric name |
period | string | Time period |
aggregation | string | avg, sum, max, min, p99 |
Response:
{ "data": { "metric": "api_requests_total", "period": "24h", "aggregation": "sum", "values": [ { "timestamp": "2025-01-14T10:00:00Z", "value": 150000 }, { "timestamp": "2025-01-14T11:00:00Z", "value": 175000 } ], "total": 3500000 }}Get Traces
Section titled “Get Traces”GET /api/v1/observability/tracesGet distributed traces.
Query Parameters:
| Parameter | Type | Description |
|---|---|---|
service | string | Filter by service |
operation | string | Filter by operation |
minDuration | number | Min duration (ms) |
status | string | OK, ERROR |
from | string | Start timestamp |
Response:
{ "data": [ { "traceId": "trace_abc123", "rootSpan": "POST /api/v1/contracts", "service": "rail-api", "duration": 245, "status": "OK", "spanCount": 12, "timestamp": "2025-01-15T10:00:00Z" } ]}Get Trace Details
Section titled “Get Trace Details”GET /api/v1/observability/traces/:traceIdGet detailed trace with all spans.
Response:
{ "data": { "traceId": "trace_abc123", "spans": [ { "spanId": "span_1", "parentSpanId": null, "operation": "POST /api/v1/contracts", "service": "rail-api", "duration": 245, "status": "OK", "tags": { "http.method": "POST", "http.status_code": 201 } }, { "spanId": "span_2", "parentSpanId": "span_1", "operation": "db.insert", "service": "postgresql", "duration": 45, "status": "OK" } ] }}Get Alerts
Section titled “Get Alerts”GET /api/v1/observability/alertsGet active alerts.
Response:
{ "data": [ { "alertId": "alert_xyz", "name": "High Error Rate", "severity": "WARNING", "status": "FIRING", "message": "Error rate > 1% for contracts rail", "startedAt": "2025-01-15T09:45:00Z", "labels": { "rail": "contracts", "environment": "production" } } ]}Acknowledge Alert
Section titled “Acknowledge Alert”POST /api/v1/observability/alerts/:alertId/acknowledgeAcknowledge an alert.
Get Error Rates
Section titled “Get Error Rates”GET /api/v1/observability/errorsGet error rates by rail/endpoint.
Response:
{ "data": { "period": "1h", "totalRequests": 150000, "totalErrors": 150, "errorRate": 0.001, "byRail": { "contracts": { "requests": 50000, "errors": 50, "rate": 0.001 }, "kyc": { "requests": 30000, "errors": 30, "rate": 0.001 } }, "topErrors": [ { "code": "VALIDATION_ERROR", "count": 100 }, { "code": "NOT_FOUND", "count": 35 } ] }}Available Metrics
Section titled “Available Metrics”| Metric | Description |
|---|---|
api_requests_total | Total API requests |
api_request_duration_ms | Request duration |
api_errors_total | Total errors |
db_connections_active | Active DB connections |
cache_hit_ratio | Cache hit ratio |
queue_depth | Message queue depth |
SLO Types
Section titled “SLO Types”| Type | Description |
|---|---|
| Availability | Service availability |
| Latency | Response time |
| Error Rate | Error percentage |
| Throughput | Request volume |
Alert Severities
Section titled “Alert Severities”| Severity | Description |
|---|---|
| CRITICAL | Immediate action required |
| WARNING | Attention needed |
| INFO | Informational |
Events
Section titled “Events”| Event | Description |
|---|---|
observability.alert.fired | Alert triggered |
observability.alert.resolved | Alert resolved |
observability.slo.breach | SLO breached |
See Also
Section titled “See Also”- Audit Rail - Audit logging
- Analytics Rail - Business analytics