Files
Operation-Blue-Laminate-v2/monitoring/README.md
2026-06-01 10:52:06 -05:00

149 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BlueLaminate observability stack (standalone, Proxmox LXC)
A self-contained Grafana **LGTM** stack — **L**oki (logs), **G**rafana (dashboards),
**T**empo (traces), and Prometheus (**M**etrics) — fronted by **Grafana Alloy** as a single
OTLP ingress. It runs as native systemd services on its own Proxmox LXC, decoupled from the
app's `docker-compose.yml`. The C2 and Python workers push OpenTelemetry data to Alloy, which
fans the three signals out to the backends; Grafana ties them together.
```
C2 / workers ──OTLP(4317 grpc / 4318 http)──► Alloy ──┬─► Loki (logs, :3100)
(other host) ├─► Prometheus (metrics, :9090, remote-write)
└─► Tempo (traces, :4319 OTLP → store)
Grafana (:3000)
datasources: Loki + Prometheus + Tempo
```
Only Alloy's OTLP ports (`4317`/`4318`) and Grafana (`3000`) need to be reachable from the
LAN. Loki and Tempo bind localhost; Alloy is the only client that talks to them.
## Layout
```
monitoring/
install.sh # idempotent provisioner — run as root in the LXC
alloy/config.alloy # OTLP receiver → batch → Loki / Prometheus / Tempo
prometheus/prometheus.yml # self-monitoring scrapes (app metrics arrive via remote-write)
prometheus/prometheus.service # systemd unit: remote-write + OTLP receivers, 15d retention
loki/loki.yml # single-binary, filesystem store, 15d retention
tempo/tempo.yml # OTLP on :4319, local store, metrics_generator → Prometheus
grafana/datasources.yml # Loki + Prometheus(default) + Tempo, correlated
grafana/dashboards.yml # file-based dashboard provider
grafana/dashboards/overview.json # starter dashboard (target health, span rates, logs)
```
## 1. Create the LXC (run on the Proxmox host)
Reference only — adjust the storage, bridge, and template names to your node. An unprivileged
Debian 13 container with ~2 vCPU / 24 GB RAM / 2040 GB disk is plenty.
```bash
# Make sure a Debian 13 template is present (once):
# pveam update && pveam available | grep debian-13
# pveam download local debian-13-standard_*_amd64.tar.zst
pct create 910 local:vztmpl/debian-13-standard_13.0-1_amd64.tar.zst \
--hostname grafana-lxc \
--cores 2 --memory 4096 --swap 1024 \
--rootfs local-lvm:32 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--unprivileged 1 --features nesting=0 \
--onboot 1 --start 1
# (Optional) give it a static IP instead of dhcp, e.g.
# --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1
```
`nesting=0` is fine — there's no Docker here, just native binaries.
## 2. Deploy the stack (inside the LXC)
```bash
pct enter 910 # or: ssh root@<lxc-ip>
apt-get update && apt-get install -y git
git clone <this-repo-url> /opt/bluelaminate
cd /opt/bluelaminate/monitoring
sudo bash install.sh
```
No git on the LXC? Copy just this folder over instead:
`scp -r monitoring root@<lxc-ip>:/opt/monitoring && ssh root@<lxc-ip> 'cd /opt/monitoring && bash install.sh'`
The script adds the Grafana apt repo, installs grafana/loki/tempo/alloy, drops the Prometheus
release binary into `/opt/prometheus`, lays our configs over the packaged defaults, and
enables all five services. It prints the URLs and the OTLP endpoint when done.
## 3. Verify
```bash
systemctl is-active grafana-server loki tempo prometheus alloy # all → active
curl -s localhost:3100/ready # Loki → ready
curl -s localhost:3200/ready # Tempo → ready
curl -s localhost:9090/-/ready # Prometheus → Ready
```
Open Grafana at `http://<lxc-ip>:3000` (first login `admin` / `admin` — change it). The three
datasources and the **BlueLaminate → Stack Overview** dashboard are provisioned automatically.
Alloy's pipeline graph is at `http://<lxc-ip>:12345`.
### End-to-end OTLP smoke test (no app changes needed)
Send synthetic telemetry from any machine that can reach the LXC, using the OpenTelemetry
`telemetrygen` tool (`go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest`):
```bash
telemetrygen traces --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --traces 5
telemetrygen metrics --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --duration 10s
telemetrygen logs --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --logs 5
```
Then in Grafana **Explore**: pick **Tempo** (search recent traces), **Prometheus** (query
`gen`), and **Loki** (`{service_name=~".+"}`) — seeing data in all three confirms the full
fan-out before any app is wired up.
## 4. Wiring the apps later (the OTLP contract)
This deployment is **stack-only**; the C2 and workers aren't instrumented yet. When you do,
point them at this LXC — nothing here changes. The drop-in:
**.NET C2** (`BlueLaminate.C2`) — add packages `OpenTelemetry.Extensions.Hosting`,
`OpenTelemetry.Exporter.OpenTelemetryProtocol`, and the
`OpenTelemetry.Instrumentation.AspNetCore` / `.Http` / runtime instrumentations, then
`builder.Services.AddOpenTelemetry().WithTracing(...).WithMetrics(...)` plus
`builder.Logging.AddOpenTelemetry(...)`. Configure via env:
```
OTEL_EXPORTER_OTLP_ENDPOINT=http://<lxc-ip>:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=bluelaminate-c2
```
**Python workers** (`worker/csmoney_worker.py`, `skinland_worker.py`) — add
`opentelemetry-distro` and `opentelemetry-exporter-otlp` to `worker/requirements.txt`, run
under `opentelemetry-instrument python csmoney_worker.py`, same env vars with
`OTEL_SERVICE_NAME=csmoney-worker` / `skinland-worker`. (Today the workers emit structured
JSON logs to stdout — `LOG_JSON=1`, set by default in the image; an interim option is to
ship their Docker stdout to Loki with an Alloy `loki.source.docker` component on the app
host, which can parse those JSON fields directly, instead of instrumenting in-process.)
Add those env vars to the matching `docker-compose.yml` services when the instrumentation lands.
## Hardening
- **Firewall the OTLP ports.** `4317`/`4318` are bound to `0.0.0.0`. Restrict them to the app
host, e.g. `ufw allow from <app-host-ip> to any port 4317,4318 proto tcp`.
- **Auth on ingest (optional).** Add an `otelcol.auth.bearer` handler to
`otelcol.receiver.otlp` in `alloy/config.alloy` and send a matching
`OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer <token>` from the apps.
- **Grafana password.** Change `admin` on first login, or set
`GF_SECURITY_ADMIN_PASSWORD` in `/etc/grafana/grafana.ini`.
## Retention / sizing
Defaults are LXC-friendly: Prometheus **15d**, Loki **15d**, Tempo **7d**. Bump the
`retention.time` flag (`prometheus.service`), `limits_config.retention_period` (`loki.yml`),
and `compactor.compaction.block_retention` (`tempo.yml`) if you have the disk. Re-run
`install.sh` to apply config edits.
```