Observability Rail

Platform monitoring, SLO tracking, and distributed tracing.

Overview

The Observability Rail provides endpoints for monitoring platform health, SLO tracking, distributed tracing, and metrics collection.

Base URL

/api/v1/observability

Endpoints

Get Health Status

GET /api/v1/observability/health

Get platform health status.

Response:

{
  "data": {
    "status": "HEALTHY",
    "timestamp": "2025-01-15T10:00:00Z",
    "components": {
      "api": { "status": "HEALTHY", "latency": 45 },
      "database": { "status": "HEALTHY", "latency": 12 },
      "cache": { "status": "HEALTHY", "latency": 2 },
      "queue": { "status": "HEALTHY", "latency": 5 }
    },
    "version": "2.0.0"
  }
}

Get SLO Status

GET /api/v1/observability/slos

Get Service Level Objective status.

Response:

{
  "data": [
    {
      "sloId": "api_availability",
      "name": "API Availability",
      "target": 0.999,
      "current": 0.9995,
      "status": "MET",
      "period": "30d",
      "errorBudgetRemaining": 0.0005
    },
    {
      "sloId": "api_latency_p99",
      "name": "API Latency P99",
      "target": 500,
      "current": 245,
      "unit": "ms",
      "status": "MET",
      "period": "30d"
    }
  ]
}

Get Metrics

GET /api/v1/observability/metrics

Get platform metrics.

Query Parameters:

Parameter	Type	Description
`metric`	string	Metric name
`period`	string	Time period
`aggregation`	string	avg, sum, max, min, p99

Response:

{
  "data": {
    "metric": "api_requests_total",
    "period": "24h",
    "aggregation": "sum",
    "values": [
      { "timestamp": "2025-01-14T10:00:00Z", "value": 150000 },
      { "timestamp": "2025-01-14T11:00:00Z", "value": 175000 }
    ],
    "total": 3500000
  }
}

Get Traces

GET /api/v1/observability/traces

Get distributed traces.

Query Parameters:

Parameter	Type	Description
`service`	string	Filter by service
`operation`	string	Filter by operation
`minDuration`	number	Min duration (ms)
`status`	string	OK, ERROR
`from`	string	Start timestamp

Response:

{
  "data": [
    {
      "traceId": "trace_abc123",
      "rootSpan": "POST /api/v1/contracts",
      "service": "rail-api",
      "duration": 245,
      "status": "OK",
      "spanCount": 12,
      "timestamp": "2025-01-15T10:00:00Z"
    }
  ]
}

Get Trace Details

GET /api/v1/observability/traces/:traceId

Get detailed trace with all spans.

Response:

{
  "data": {
    "traceId": "trace_abc123",
    "spans": [
      {
        "spanId": "span_1",
        "parentSpanId": null,
        "operation": "POST /api/v1/contracts",
        "service": "rail-api",
        "duration": 245,
        "status": "OK",
        "tags": {
          "http.method": "POST",
          "http.status_code": 201
        }
      },
      {
        "spanId": "span_2",
        "parentSpanId": "span_1",
        "operation": "db.insert",
        "service": "postgresql",
        "duration": 45,
        "status": "OK"
      }
    ]
  }
}

Get Alerts

GET /api/v1/observability/alerts

Get active alerts.

Response:

{
  "data": [
    {
      "alertId": "alert_xyz",
      "name": "High Error Rate",
      "severity": "WARNING",
      "status": "FIRING",
      "message": "Error rate > 1% for contracts rail",
      "startedAt": "2025-01-15T09:45:00Z",
      "labels": {
        "rail": "contracts",
        "environment": "production"
      }
    }
  ]
}

Acknowledge Alert

POST /api/v1/observability/alerts/:alertId/acknowledge

Acknowledge an alert.

Get Error Rates

GET /api/v1/observability/errors

Get error rates by rail/endpoint.

Response:

{
  "data": {
    "period": "1h",
    "totalRequests": 150000,
    "totalErrors": 150,
    "errorRate": 0.001,
    "byRail": {
      "contracts": { "requests": 50000, "errors": 50, "rate": 0.001 },
      "kyc": { "requests": 30000, "errors": 30, "rate": 0.001 }
    },
    "topErrors": [
      { "code": "VALIDATION_ERROR", "count": 100 },
      { "code": "NOT_FOUND", "count": 35 }
    ]
  }
}

Available Metrics

Metric	Description
`api_requests_total`	Total API requests
`api_request_duration_ms`	Request duration
`api_errors_total`	Total errors
`db_connections_active`	Active DB connections
`cache_hit_ratio`	Cache hit ratio
`queue_depth`	Message queue depth

SLO Types

Type	Description
Availability	Service availability
Latency	Response time
Error Rate	Error percentage
Throughput	Request volume

Alert Severities

Severity	Description
CRITICAL	Immediate action required
WARNING	Attention needed
INFO	Informational

Events

Event	Description
`observability.alert.fired`	Alert triggered
`observability.alert.resolved`	Alert resolved
`observability.slo.breach`	SLO breached

Observability Rail

Observability Rail

Overview

Base URL

Endpoints

Get Health Status

Get SLO Status

Get Metrics

Get Traces

Get Trace Details

Get Alerts

Acknowledge Alert

Get Error Rates

Available Metrics

SLO Types

Alert Severities

Events

See Also