almost ready
This commit is contained in:
148
monitoring/README.md
Normal file
148
monitoring/README.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# BlueLaminate observability stack (standalone, Proxmox LXC)
|
||||
|
||||
A self-contained Grafana **LGTM** stack — **L**oki (logs), **G**rafana (dashboards),
|
||||
**T**empo (traces), and Prometheus (**M**etrics) — fronted by **Grafana Alloy** as a single
|
||||
OTLP ingress. It runs as native systemd services on its own Proxmox LXC, decoupled from the
|
||||
app's `docker-compose.yml`. The C2 and Python workers push OpenTelemetry data to Alloy, which
|
||||
fans the three signals out to the backends; Grafana ties them together.
|
||||
|
||||
```
|
||||
C2 / workers ──OTLP(4317 grpc / 4318 http)──► Alloy ──┬─► Loki (logs, :3100)
|
||||
(other host) ├─► Prometheus (metrics, :9090, remote-write)
|
||||
└─► Tempo (traces, :4319 OTLP → store)
|
||||
│
|
||||
Grafana (:3000)
|
||||
datasources: Loki + Prometheus + Tempo
|
||||
```
|
||||
|
||||
Only Alloy's OTLP ports (`4317`/`4318`) and Grafana (`3000`) need to be reachable from the
|
||||
LAN. Loki and Tempo bind localhost; Alloy is the only client that talks to them.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
monitoring/
|
||||
install.sh # idempotent provisioner — run as root in the LXC
|
||||
alloy/config.alloy # OTLP receiver → batch → Loki / Prometheus / Tempo
|
||||
prometheus/prometheus.yml # self-monitoring scrapes (app metrics arrive via remote-write)
|
||||
prometheus/prometheus.service # systemd unit: remote-write + OTLP receivers, 15d retention
|
||||
loki/loki.yml # single-binary, filesystem store, 15d retention
|
||||
tempo/tempo.yml # OTLP on :4319, local store, metrics_generator → Prometheus
|
||||
grafana/datasources.yml # Loki + Prometheus(default) + Tempo, correlated
|
||||
grafana/dashboards.yml # file-based dashboard provider
|
||||
grafana/dashboards/overview.json # starter dashboard (target health, span rates, logs)
|
||||
```
|
||||
|
||||
## 1. Create the LXC (run on the Proxmox host)
|
||||
|
||||
Reference only — adjust the storage, bridge, and template names to your node. An unprivileged
|
||||
Debian 13 container with ~2 vCPU / 2–4 GB RAM / 20–40 GB disk is plenty.
|
||||
|
||||
```bash
|
||||
# Make sure a Debian 13 template is present (once):
|
||||
# pveam update && pveam available | grep debian-13
|
||||
# pveam download local debian-13-standard_*_amd64.tar.zst
|
||||
|
||||
pct create 910 local:vztmpl/debian-13-standard_13.0-1_amd64.tar.zst \
|
||||
--hostname grafana-lxc \
|
||||
--cores 2 --memory 4096 --swap 1024 \
|
||||
--rootfs local-lvm:32 \
|
||||
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
|
||||
--unprivileged 1 --features nesting=0 \
|
||||
--onboot 1 --start 1
|
||||
|
||||
# (Optional) give it a static IP instead of dhcp, e.g.
|
||||
# --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1
|
||||
```
|
||||
|
||||
`nesting=0` is fine — there's no Docker here, just native binaries.
|
||||
|
||||
## 2. Deploy the stack (inside the LXC)
|
||||
|
||||
```bash
|
||||
pct enter 910 # or: ssh root@<lxc-ip>
|
||||
apt-get update && apt-get install -y git
|
||||
git clone <this-repo-url> /opt/bluelaminate
|
||||
cd /opt/bluelaminate/monitoring
|
||||
sudo bash install.sh
|
||||
```
|
||||
|
||||
No git on the LXC? Copy just this folder over instead:
|
||||
`scp -r monitoring root@<lxc-ip>:/opt/monitoring && ssh root@<lxc-ip> 'cd /opt/monitoring && bash install.sh'`
|
||||
|
||||
The script adds the Grafana apt repo, installs grafana/loki/tempo/alloy, drops the Prometheus
|
||||
release binary into `/opt/prometheus`, lays our configs over the packaged defaults, and
|
||||
enables all five services. It prints the URLs and the OTLP endpoint when done.
|
||||
|
||||
## 3. Verify
|
||||
|
||||
```bash
|
||||
systemctl is-active grafana-server loki tempo prometheus alloy # all → active
|
||||
curl -s localhost:3100/ready # Loki → ready
|
||||
curl -s localhost:3200/ready # Tempo → ready
|
||||
curl -s localhost:9090/-/ready # Prometheus → Ready
|
||||
```
|
||||
|
||||
Open Grafana at `http://<lxc-ip>:3000` (first login `admin` / `admin` — change it). The three
|
||||
datasources and the **BlueLaminate → Stack Overview** dashboard are provisioned automatically.
|
||||
Alloy's pipeline graph is at `http://<lxc-ip>:12345`.
|
||||
|
||||
### End-to-end OTLP smoke test (no app changes needed)
|
||||
|
||||
Send synthetic telemetry from any machine that can reach the LXC, using the OpenTelemetry
|
||||
`telemetrygen` tool (`go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest`):
|
||||
|
||||
```bash
|
||||
telemetrygen traces --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --traces 5
|
||||
telemetrygen metrics --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --duration 10s
|
||||
telemetrygen logs --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --logs 5
|
||||
```
|
||||
|
||||
Then in Grafana **Explore**: pick **Tempo** (search recent traces), **Prometheus** (query
|
||||
`gen`), and **Loki** (`{service_name=~".+"}`) — seeing data in all three confirms the full
|
||||
fan-out before any app is wired up.
|
||||
|
||||
## 4. Wiring the apps later (the OTLP contract)
|
||||
|
||||
This deployment is **stack-only**; the C2 and workers aren't instrumented yet. When you do,
|
||||
point them at this LXC — nothing here changes. The drop-in:
|
||||
|
||||
**.NET C2** (`BlueLaminate.C2`) — add packages `OpenTelemetry.Extensions.Hosting`,
|
||||
`OpenTelemetry.Exporter.OpenTelemetryProtocol`, and the
|
||||
`OpenTelemetry.Instrumentation.AspNetCore` / `.Http` / runtime instrumentations, then
|
||||
`builder.Services.AddOpenTelemetry().WithTracing(...).WithMetrics(...)` plus
|
||||
`builder.Logging.AddOpenTelemetry(...)`. Configure via env:
|
||||
|
||||
```
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://<lxc-ip>:4318
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
|
||||
OTEL_SERVICE_NAME=bluelaminate-c2
|
||||
```
|
||||
|
||||
**Python workers** (`worker/csmoney_worker.py`, `skinland_worker.py`) — add
|
||||
`opentelemetry-distro` and `opentelemetry-exporter-otlp` to `worker/requirements.txt`, run
|
||||
under `opentelemetry-instrument python csmoney_worker.py`, same env vars with
|
||||
`OTEL_SERVICE_NAME=csmoney-worker` / `skinland-worker`. (Today the workers emit structured
|
||||
JSON logs to stdout — `LOG_JSON=1`, set by default in the image; an interim option is to
|
||||
ship their Docker stdout to Loki with an Alloy `loki.source.docker` component on the app
|
||||
host, which can parse those JSON fields directly, instead of instrumenting in-process.)
|
||||
|
||||
Add those env vars to the matching `docker-compose.yml` services when the instrumentation lands.
|
||||
|
||||
## Hardening
|
||||
|
||||
- **Firewall the OTLP ports.** `4317`/`4318` are bound to `0.0.0.0`. Restrict them to the app
|
||||
host, e.g. `ufw allow from <app-host-ip> to any port 4317,4318 proto tcp`.
|
||||
- **Auth on ingest (optional).** Add an `otelcol.auth.bearer` handler to
|
||||
`otelcol.receiver.otlp` in `alloy/config.alloy` and send a matching
|
||||
`OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer <token>` from the apps.
|
||||
- **Grafana password.** Change `admin` on first login, or set
|
||||
`GF_SECURITY_ADMIN_PASSWORD` in `/etc/grafana/grafana.ini`.
|
||||
|
||||
## Retention / sizing
|
||||
|
||||
Defaults are LXC-friendly: Prometheus **15d**, Loki **15d**, Tempo **7d**. Bump the
|
||||
`retention.time` flag (`prometheus.service`), `limits_config.retention_period` (`loki.yml`),
|
||||
and `compactor.compaction.block_retention` (`tempo.yml`) if you have the disk. Re-run
|
||||
`install.sh` to apply config edits.
|
||||
```
|
||||
67
monitoring/alloy/config.alloy
Normal file
67
monitoring/alloy/config.alloy
Normal file
@@ -0,0 +1,67 @@
|
||||
// Grafana Alloy — the single OTLP ingress for the BlueLaminate fleet.
|
||||
//
|
||||
// Receives OTLP (gRPC :4317 / HTTP :4318) from the C2 and the Python workers, batches it,
|
||||
// then fans the three signals out to the local backends:
|
||||
// metrics -> Prometheus (remote-write)
|
||||
// logs -> Loki (push API)
|
||||
// traces -> Tempo (OTLP gRPC on :4319, a non-colliding port)
|
||||
//
|
||||
// OTLP is bound on 0.0.0.0 so apps on other LAN hosts can push to this LXC. Everything it
|
||||
// forwards to listens on localhost only (see each backend's config) — Alloy is the only
|
||||
// thing that talks to Loki/Prometheus/Tempo. See README "Hardening" to add a bearer token.
|
||||
|
||||
otelcol.receiver.otlp "in" {
|
||||
grpc {
|
||||
endpoint = "0.0.0.0:4317"
|
||||
}
|
||||
http {
|
||||
endpoint = "0.0.0.0:4318"
|
||||
}
|
||||
output {
|
||||
metrics = [otelcol.processor.batch.default.input]
|
||||
logs = [otelcol.processor.batch.default.input]
|
||||
traces = [otelcol.processor.batch.default.input]
|
||||
}
|
||||
}
|
||||
|
||||
otelcol.processor.batch "default" {
|
||||
output {
|
||||
metrics = [otelcol.exporter.prometheus.to_prom.input]
|
||||
logs = [otelcol.exporter.loki.to_loki.input]
|
||||
traces = [otelcol.exporter.otlp.to_tempo.input]
|
||||
}
|
||||
}
|
||||
|
||||
// --- metrics -> Prometheus remote-write ---------------------------------------------------
|
||||
otelcol.exporter.prometheus "to_prom" {
|
||||
forward_to = [prometheus.remote_write.local.receiver]
|
||||
}
|
||||
|
||||
prometheus.remote_write "local" {
|
||||
endpoint {
|
||||
url = "http://localhost:9090/api/v1/write"
|
||||
}
|
||||
}
|
||||
|
||||
// --- logs -> Loki push --------------------------------------------------------------------
|
||||
otelcol.exporter.loki "to_loki" {
|
||||
forward_to = [loki.write.local.receiver]
|
||||
}
|
||||
|
||||
loki.write "local" {
|
||||
endpoint {
|
||||
url = "http://localhost:3100/loki/api/v1/push"
|
||||
}
|
||||
}
|
||||
|
||||
// --- traces -> Tempo ----------------------------------------------------------------------
|
||||
// Tempo's own OTLP receiver listens on :4319 so it doesn't collide with this Alloy receiver
|
||||
// on :4317/:4318. TLS off — it's a localhost hop.
|
||||
otelcol.exporter.otlp "to_tempo" {
|
||||
client {
|
||||
endpoint = "localhost:4319"
|
||||
tls {
|
||||
insecure = true
|
||||
}
|
||||
}
|
||||
}
|
||||
15
monitoring/grafana/dashboards.yml
Normal file
15
monitoring/grafana/dashboards.yml
Normal file
@@ -0,0 +1,15 @@
|
||||
# Grafana dashboard provider — loads JSON dashboards from /var/lib/grafana/dashboards.
|
||||
# Copied to /etc/grafana/provisioning/dashboards/ by install.sh.
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: BlueLaminate
|
||||
orgId: 1
|
||||
folder: BlueLaminate
|
||||
type: file
|
||||
disableDeletion: false
|
||||
allowUiUpdates: true
|
||||
updateIntervalSeconds: 30
|
||||
options:
|
||||
path: /var/lib/grafana/dashboards
|
||||
foldersFromFilesStructure: false
|
||||
109
monitoring/grafana/dashboards/overview.json
Normal file
109
monitoring/grafana/dashboards/overview.json
Normal file
@@ -0,0 +1,109 @@
|
||||
{
|
||||
"annotations": { "list": [] },
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"mappings": [
|
||||
{ "type": "value", "options": { "0": { "text": "DOWN", "color": "red" }, "1": { "text": "UP", "color": "green" } } }
|
||||
],
|
||||
"thresholds": { "mode": "absolute", "steps": [ { "color": "red", "value": null }, { "color": "green", "value": 1 } ] }
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 24, "x": 0, "y": 0 },
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "value_and_name"
|
||||
},
|
||||
"pluginVersion": "11.0.0",
|
||||
"targets": [
|
||||
{ "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "up", "legendFormat": "{{job}}", "refId": "A" }
|
||||
],
|
||||
"title": "Stack targets — up/down",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": { "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 }, "unit": "reqps" },
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
|
||||
"id": 2,
|
||||
"options": { "legend": { "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } },
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum by (service_name) (rate(traces_spanmetrics_calls_total[5m]))",
|
||||
"legendFormat": "{{service_name}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Span call rate by service (Tempo span-metrics)",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": { "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 1 }, "unit": "bytes" },
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
|
||||
"id": 3,
|
||||
"options": { "legend": { "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } },
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "process_resident_memory_bytes",
|
||||
"legendFormat": "{{job}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Stack process memory",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "loki", "uid": "loki" },
|
||||
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 14 },
|
||||
"id": 4,
|
||||
"options": {
|
||||
"dedupStrategy": "none",
|
||||
"enableLogDetails": true,
|
||||
"showTime": true,
|
||||
"sortOrder": "Descending",
|
||||
"wrapLogMessage": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "loki", "uid": "loki" },
|
||||
"expr": "{service_name=~\".+\"}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Recent logs (all services)",
|
||||
"type": "logs"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 39,
|
||||
"tags": ["bluelaminate"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-6h", "to": "now" },
|
||||
"timepicker": {},
|
||||
"timezone": "",
|
||||
"title": "BlueLaminate — Stack Overview",
|
||||
"uid": "bl-overview",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
53
monitoring/grafana/datasources.yml
Normal file
53
monitoring/grafana/datasources.yml
Normal file
@@ -0,0 +1,53 @@
|
||||
# Grafana datasource provisioning — Prometheus (default), Loki, Tempo, wired for
|
||||
# trace <-> log <-> metric correlation. Copied to
|
||||
# /etc/grafana/provisioning/datasources/ by install.sh.
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
uid: prometheus
|
||||
access: proxy
|
||||
url: http://localhost:9090
|
||||
isDefault: true
|
||||
jsonData:
|
||||
httpMethod: POST
|
||||
|
||||
- name: Loki
|
||||
type: loki
|
||||
uid: loki
|
||||
access: proxy
|
||||
url: http://localhost:3100
|
||||
jsonData:
|
||||
# Turn a trace_id found on a log line into a clickable jump to the trace in Tempo.
|
||||
# OTLP logs carry the id as structured metadata `trace_id`; adjust the regex if your
|
||||
# app instrumentation emits it differently.
|
||||
derivedFields:
|
||||
- name: TraceID
|
||||
matcherType: label
|
||||
matcherRegex: trace_id
|
||||
datasourceUid: tempo
|
||||
url: "${__value.raw}"
|
||||
urlDisplayLabel: "View trace"
|
||||
|
||||
- name: Tempo
|
||||
type: tempo
|
||||
uid: tempo
|
||||
access: proxy
|
||||
url: http://localhost:3200
|
||||
jsonData:
|
||||
# Span -> related logs in Loki.
|
||||
tracesToLogsV2:
|
||||
datasourceUid: loki
|
||||
spanStartTimeShift: "-1h"
|
||||
spanEndTimeShift: "1h"
|
||||
filterByTraceID: true
|
||||
filterBySpanID: false
|
||||
# Span -> RED metrics in Prometheus (from Tempo's metrics_generator).
|
||||
tracesToMetrics:
|
||||
datasourceUid: prometheus
|
||||
# Service graph + node graph from the generator's service-graph metrics.
|
||||
serviceMap:
|
||||
datasourceUid: prometheus
|
||||
nodeGraph:
|
||||
enabled: true
|
||||
122
monitoring/install.sh
Normal file
122
monitoring/install.sh
Normal file
@@ -0,0 +1,122 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# Provision the standalone BlueLaminate observability stack on a fresh Debian LXC:
|
||||
# Grafana + Loki + Tempo + Alloy (Grafana apt repo, each with its own systemd unit)
|
||||
# Prometheus (official release tarball -> /opt/prometheus + our unit)
|
||||
#
|
||||
# Idempotent: safe to re-run (re-applies configs and restarts services). Run as root.
|
||||
#
|
||||
# sudo ./install.sh
|
||||
#
|
||||
# Override the Prometheus version with PROM_VERSION=x.y.z ./install.sh if needed.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
if [[ "${EUID}" -ne 0 ]]; then
|
||||
echo "ERROR: run as root (sudo ./install.sh)." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
ARCH="$(dpkg --print-architecture)" # amd64 / arm64
|
||||
echo "==> Target architecture: ${ARCH}"
|
||||
|
||||
# --- prerequisites ------------------------------------------------------------------------
|
||||
echo "==> Installing prerequisites"
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
apt-get update -y
|
||||
apt-get install -y apt-transport-https software-properties-common gpg wget curl tar
|
||||
|
||||
# --- Grafana apt repo: grafana, loki, tempo, alloy ----------------------------------------
|
||||
echo "==> Adding the Grafana apt repository"
|
||||
mkdir -p /etc/apt/keyrings
|
||||
if [[ ! -s /etc/apt/keyrings/grafana.asc ]]; then
|
||||
wget -qO /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
|
||||
fi
|
||||
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" \
|
||||
> /etc/apt/sources.list.d/grafana.list
|
||||
apt-get update -y
|
||||
|
||||
echo "==> Installing Grafana, Loki, Tempo, Alloy"
|
||||
apt-get install -y grafana loki tempo alloy
|
||||
|
||||
# --- Prometheus (release tarball) ---------------------------------------------------------
|
||||
echo "==> Installing Prometheus"
|
||||
PROM_VERSION="${PROM_VERSION:-$(curl -fsSL https://api.github.com/repos/prometheus/prometheus/releases/latest \
|
||||
| grep -oP '"tag_name":\s*"v\K[^"]+' || true)}"
|
||||
PROM_VERSION="${PROM_VERSION:-3.2.1}"
|
||||
echo " Prometheus version: ${PROM_VERSION}"
|
||||
|
||||
id -u prometheus &>/dev/null || useradd --system --no-create-home --shell /usr/sbin/nologin prometheus
|
||||
|
||||
PROM_DIR="prometheus-${PROM_VERSION}.linux-${ARCH}"
|
||||
TMP="$(mktemp -d)"
|
||||
trap 'rm -rf "${TMP}"' EXIT
|
||||
wget -qO "${TMP}/prom.tar.gz" \
|
||||
"https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/${PROM_DIR}.tar.gz"
|
||||
tar -xzf "${TMP}/prom.tar.gz" -C "${TMP}"
|
||||
install -d /opt/prometheus
|
||||
install -m 0755 "${TMP}/${PROM_DIR}/prometheus" /opt/prometheus/prometheus
|
||||
install -m 0755 "${TMP}/${PROM_DIR}/promtool" /opt/prometheus/promtool
|
||||
|
||||
# --- data directories ---------------------------------------------------------------------
|
||||
echo "==> Creating data directories"
|
||||
install -d -o prometheus -g prometheus /var/lib/prometheus
|
||||
install -d -o loki -g loki /var/lib/loki /var/lib/loki/chunks /var/lib/loki/rules /var/lib/loki/compactor
|
||||
install -d -o tempo -g tempo /var/lib/tempo /var/lib/tempo/wal /var/lib/tempo/blocks \
|
||||
/var/lib/tempo/generator/wal /var/lib/tempo/generator/traces
|
||||
|
||||
# --- configuration ------------------------------------------------------------------------
|
||||
echo "==> Installing configuration files"
|
||||
install -d /etc/alloy /etc/loki /etc/tempo /etc/prometheus
|
||||
install -m 0644 "${SCRIPT_DIR}/alloy/config.alloy" /etc/alloy/config.alloy
|
||||
install -m 0644 "${SCRIPT_DIR}/loki/loki.yml" /etc/loki/config.yml
|
||||
install -m 0644 "${SCRIPT_DIR}/tempo/tempo.yml" /etc/tempo/config.yml
|
||||
install -m 0644 "${SCRIPT_DIR}/prometheus/prometheus.yml" /etc/prometheus/prometheus.yml
|
||||
install -m 0644 "${SCRIPT_DIR}/prometheus/prometheus.service" /etc/systemd/system/prometheus.service
|
||||
|
||||
# Point Alloy's systemd unit at our config (the package reads /etc/default/alloy).
|
||||
cat > /etc/default/alloy <<'EOF'
|
||||
CONFIG_FILE="/etc/alloy/config.alloy"
|
||||
CUSTOM_ARGS=""
|
||||
RESTART_ON_UPGRADE=true
|
||||
EOF
|
||||
|
||||
# Grafana provisioning (datasources + dashboards).
|
||||
echo "==> Installing Grafana provisioning"
|
||||
install -d /etc/grafana/provisioning/datasources \
|
||||
/etc/grafana/provisioning/dashboards \
|
||||
/var/lib/grafana/dashboards
|
||||
install -m 0644 "${SCRIPT_DIR}/grafana/datasources.yml" /etc/grafana/provisioning/datasources/bluelaminate.yml
|
||||
install -m 0644 "${SCRIPT_DIR}/grafana/dashboards.yml" /etc/grafana/provisioning/dashboards/bluelaminate.yml
|
||||
install -m 0644 "${SCRIPT_DIR}"/grafana/dashboards/*.json /var/lib/grafana/dashboards/
|
||||
chown -R grafana:grafana /var/lib/grafana/dashboards 2>/dev/null || true
|
||||
|
||||
# --- start everything ---------------------------------------------------------------------
|
||||
echo "==> Enabling + starting services"
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now grafana-server loki tempo prometheus alloy
|
||||
systemctl restart loki tempo prometheus alloy grafana-server
|
||||
|
||||
# --- summary ------------------------------------------------------------------------------
|
||||
IP="$(hostname -I 2>/dev/null | awk '{print $1}')"
|
||||
cat <<EOF
|
||||
|
||||
============================================================================
|
||||
BlueLaminate observability stack installed.
|
||||
|
||||
Grafana UI : http://${IP:-<lxc-ip>}:3000 (first login admin/admin)
|
||||
OTLP ingress : ${IP:-<lxc-ip>}:4317 (gRPC) / ${IP:-<lxc-ip>}:4318 (HTTP)
|
||||
Alloy debug UI : http://${IP:-<lxc-ip>}:12345
|
||||
Prometheus : http://${IP:-<lxc-ip>}:9090
|
||||
|
||||
Point apps at: OTEL_EXPORTER_OTLP_ENDPOINT=http://${IP:-<lxc-ip>}:4318
|
||||
|
||||
Readiness checks:
|
||||
systemctl is-active grafana-server loki tempo prometheus alloy
|
||||
curl -s localhost:3100/ready # Loki
|
||||
curl -s localhost:3200/ready # Tempo
|
||||
curl -s localhost:9090/-/ready # Prometheus
|
||||
============================================================================
|
||||
EOF
|
||||
59
monitoring/loki/loki.yml
Normal file
59
monitoring/loki/loki.yml
Normal file
@@ -0,0 +1,59 @@
|
||||
# Loki — single-binary, filesystem-backed, no auth (localhost-only; Alloy is the only writer).
|
||||
# Tuned for an LXC: TSDB index, 15-day retention with the compactor enforcing deletes.
|
||||
auth_enabled: false
|
||||
|
||||
server:
|
||||
http_listen_address: 127.0.0.1
|
||||
http_listen_port: 3100
|
||||
grpc_listen_port: 9096
|
||||
log_level: info
|
||||
|
||||
common:
|
||||
instance_addr: 127.0.0.1
|
||||
path_prefix: /var/lib/loki
|
||||
storage:
|
||||
filesystem:
|
||||
chunks_directory: /var/lib/loki/chunks
|
||||
rules_directory: /var/lib/loki/rules
|
||||
replication_factor: 1
|
||||
ring:
|
||||
kvstore:
|
||||
store: inmemory
|
||||
|
||||
schema_config:
|
||||
configs:
|
||||
- from: 2024-01-01
|
||||
store: tsdb
|
||||
object_store: filesystem
|
||||
schema: v13
|
||||
index:
|
||||
prefix: index_
|
||||
period: 24h
|
||||
|
||||
limits_config:
|
||||
retention_period: 360h # 15 days
|
||||
reject_old_samples: true
|
||||
reject_old_samples_max_age: 168h
|
||||
# Required so OTLP resource/scope attributes (and trace_id/span_id) land as structured metadata.
|
||||
allow_structured_metadata: true
|
||||
volume_enabled: true
|
||||
|
||||
compactor:
|
||||
working_directory: /var/lib/loki/compactor
|
||||
compaction_interval: 10m
|
||||
retention_enabled: true
|
||||
retention_delete_delay: 2h
|
||||
delete_request_store: filesystem
|
||||
|
||||
query_range:
|
||||
results_cache:
|
||||
cache:
|
||||
embedded_cache:
|
||||
enabled: true
|
||||
max_size_mb: 100
|
||||
|
||||
ruler:
|
||||
storage:
|
||||
type: local
|
||||
local:
|
||||
directory: /var/lib/loki/rules
|
||||
25
monitoring/prometheus/prometheus.service
Normal file
25
monitoring/prometheus/prometheus.service
Normal file
@@ -0,0 +1,25 @@
|
||||
# Prometheus is not in the Grafana apt repo, so install.sh drops the release binary into
|
||||
# /opt/prometheus and installs this unit. Flags: remote-write + OTLP receivers ON (Alloy and
|
||||
# Tempo push to it), 15-day local retention.
|
||||
[Unit]
|
||||
Description=Prometheus
|
||||
Documentation=https://prometheus.io/docs/
|
||||
Wants=network-online.target
|
||||
After=network-online.target
|
||||
|
||||
[Service]
|
||||
User=prometheus
|
||||
Group=prometheus
|
||||
Type=simple
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
ExecStart=/opt/prometheus/prometheus \
|
||||
--config.file=/etc/prometheus/prometheus.yml \
|
||||
--storage.tsdb.path=/var/lib/prometheus \
|
||||
--storage.tsdb.retention.time=15d \
|
||||
--web.enable-remote-write-receiver \
|
||||
--web.enable-otlp-receiver \
|
||||
--web.listen-address=0.0.0.0:9090
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
32
monitoring/prometheus/prometheus.yml
Normal file
32
monitoring/prometheus/prometheus.yml
Normal file
@@ -0,0 +1,32 @@
|
||||
# Prometheus for the BlueLaminate observability LXC.
|
||||
#
|
||||
# App + Tempo metrics arrive via REMOTE-WRITE (Alloy and Tempo's metrics_generator push to
|
||||
# /api/v1/write — enabled by the --web.enable-remote-write-receiver flag in prometheus.service),
|
||||
# so they need no scrape config. The scrape jobs below are just the stack's own self-monitoring.
|
||||
|
||||
global:
|
||||
scrape_interval: 30s
|
||||
evaluation_interval: 30s
|
||||
external_labels:
|
||||
monitor: bluelaminate-lxc
|
||||
|
||||
scrape_configs:
|
||||
- job_name: prometheus
|
||||
static_configs:
|
||||
- targets: ["localhost:9090"]
|
||||
|
||||
- job_name: alloy
|
||||
static_configs:
|
||||
- targets: ["localhost:12345"]
|
||||
|
||||
- job_name: loki
|
||||
static_configs:
|
||||
- targets: ["localhost:3100"]
|
||||
|
||||
- job_name: tempo
|
||||
static_configs:
|
||||
- targets: ["localhost:3200"]
|
||||
|
||||
- job_name: grafana
|
||||
static_configs:
|
||||
- targets: ["localhost:3000"]
|
||||
48
monitoring/tempo/tempo.yml
Normal file
48
monitoring/tempo/tempo.yml
Normal file
@@ -0,0 +1,48 @@
|
||||
# Tempo — local-disk trace store. Receives OTLP from Alloy on :4319 (Alloy owns :4317/:4318),
|
||||
# and runs the metrics_generator to emit RED + service-graph metrics, remote-written into
|
||||
# Prometheus so Grafana can draw request rates and the service map without any app metrics.
|
||||
server:
|
||||
http_listen_address: 0.0.0.0
|
||||
http_listen_port: 3200
|
||||
grpc_listen_port: 9095
|
||||
log_level: info
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: "0.0.0.0:4319"
|
||||
|
||||
ingester:
|
||||
max_block_duration: 5m
|
||||
|
||||
compactor:
|
||||
compaction:
|
||||
block_retention: 168h # 7 days of traces
|
||||
|
||||
metrics_generator:
|
||||
registry:
|
||||
external_labels:
|
||||
source: tempo
|
||||
storage:
|
||||
path: /var/lib/tempo/generator/wal
|
||||
remote_write:
|
||||
- url: http://localhost:9090/api/v1/write
|
||||
send_exemplars: true
|
||||
traces_storage:
|
||||
path: /var/lib/tempo/generator/traces
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: local
|
||||
wal:
|
||||
path: /var/lib/tempo/wal
|
||||
local:
|
||||
path: /var/lib/tempo/blocks
|
||||
|
||||
# Turn the generator on for every tenant (single-tenant here).
|
||||
overrides:
|
||||
defaults:
|
||||
metrics_generator:
|
||||
processors: [service-graphs, span-metrics]
|
||||
Reference in New Issue
Block a user