almost ready

This commit is contained in:
bob
2026-06-01 10:52:06 -05:00
parent 8b0eb0db78
commit 763305ca89
94 changed files with 8766 additions and 2674 deletions

148
monitoring/README.md Normal file
View File

@@ -0,0 +1,148 @@
# BlueLaminate observability stack (standalone, Proxmox LXC)
A self-contained Grafana **LGTM** stack — **L**oki (logs), **G**rafana (dashboards),
**T**empo (traces), and Prometheus (**M**etrics) — fronted by **Grafana Alloy** as a single
OTLP ingress. It runs as native systemd services on its own Proxmox LXC, decoupled from the
app's `docker-compose.yml`. The C2 and Python workers push OpenTelemetry data to Alloy, which
fans the three signals out to the backends; Grafana ties them together.
```
C2 / workers ──OTLP(4317 grpc / 4318 http)──► Alloy ──┬─► Loki (logs, :3100)
(other host) ├─► Prometheus (metrics, :9090, remote-write)
└─► Tempo (traces, :4319 OTLP → store)
Grafana (:3000)
datasources: Loki + Prometheus + Tempo
```
Only Alloy's OTLP ports (`4317`/`4318`) and Grafana (`3000`) need to be reachable from the
LAN. Loki and Tempo bind localhost; Alloy is the only client that talks to them.
## Layout
```
monitoring/
install.sh # idempotent provisioner — run as root in the LXC
alloy/config.alloy # OTLP receiver → batch → Loki / Prometheus / Tempo
prometheus/prometheus.yml # self-monitoring scrapes (app metrics arrive via remote-write)
prometheus/prometheus.service # systemd unit: remote-write + OTLP receivers, 15d retention
loki/loki.yml # single-binary, filesystem store, 15d retention
tempo/tempo.yml # OTLP on :4319, local store, metrics_generator → Prometheus
grafana/datasources.yml # Loki + Prometheus(default) + Tempo, correlated
grafana/dashboards.yml # file-based dashboard provider
grafana/dashboards/overview.json # starter dashboard (target health, span rates, logs)
```
## 1. Create the LXC (run on the Proxmox host)
Reference only — adjust the storage, bridge, and template names to your node. An unprivileged
Debian 13 container with ~2 vCPU / 24 GB RAM / 2040 GB disk is plenty.
```bash
# Make sure a Debian 13 template is present (once):
# pveam update && pveam available | grep debian-13
# pveam download local debian-13-standard_*_amd64.tar.zst
pct create 910 local:vztmpl/debian-13-standard_13.0-1_amd64.tar.zst \
--hostname grafana-lxc \
--cores 2 --memory 4096 --swap 1024 \
--rootfs local-lvm:32 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--unprivileged 1 --features nesting=0 \
--onboot 1 --start 1
# (Optional) give it a static IP instead of dhcp, e.g.
# --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1
```
`nesting=0` is fine — there's no Docker here, just native binaries.
## 2. Deploy the stack (inside the LXC)
```bash
pct enter 910 # or: ssh root@<lxc-ip>
apt-get update && apt-get install -y git
git clone <this-repo-url> /opt/bluelaminate
cd /opt/bluelaminate/monitoring
sudo bash install.sh
```
No git on the LXC? Copy just this folder over instead:
`scp -r monitoring root@<lxc-ip>:/opt/monitoring && ssh root@<lxc-ip> 'cd /opt/monitoring && bash install.sh'`
The script adds the Grafana apt repo, installs grafana/loki/tempo/alloy, drops the Prometheus
release binary into `/opt/prometheus`, lays our configs over the packaged defaults, and
enables all five services. It prints the URLs and the OTLP endpoint when done.
## 3. Verify
```bash
systemctl is-active grafana-server loki tempo prometheus alloy # all → active
curl -s localhost:3100/ready # Loki → ready
curl -s localhost:3200/ready # Tempo → ready
curl -s localhost:9090/-/ready # Prometheus → Ready
```
Open Grafana at `http://<lxc-ip>:3000` (first login `admin` / `admin` — change it). The three
datasources and the **BlueLaminate → Stack Overview** dashboard are provisioned automatically.
Alloy's pipeline graph is at `http://<lxc-ip>:12345`.
### End-to-end OTLP smoke test (no app changes needed)
Send synthetic telemetry from any machine that can reach the LXC, using the OpenTelemetry
`telemetrygen` tool (`go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest`):
```bash
telemetrygen traces --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --traces 5
telemetrygen metrics --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --duration 10s
telemetrygen logs --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --logs 5
```
Then in Grafana **Explore**: pick **Tempo** (search recent traces), **Prometheus** (query
`gen`), and **Loki** (`{service_name=~".+"}`) — seeing data in all three confirms the full
fan-out before any app is wired up.
## 4. Wiring the apps later (the OTLP contract)
This deployment is **stack-only**; the C2 and workers aren't instrumented yet. When you do,
point them at this LXC — nothing here changes. The drop-in:
**.NET C2** (`BlueLaminate.C2`) — add packages `OpenTelemetry.Extensions.Hosting`,
`OpenTelemetry.Exporter.OpenTelemetryProtocol`, and the
`OpenTelemetry.Instrumentation.AspNetCore` / `.Http` / runtime instrumentations, then
`builder.Services.AddOpenTelemetry().WithTracing(...).WithMetrics(...)` plus
`builder.Logging.AddOpenTelemetry(...)`. Configure via env:
```
OTEL_EXPORTER_OTLP_ENDPOINT=http://<lxc-ip>:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=bluelaminate-c2
```
**Python workers** (`worker/csmoney_worker.py`, `skinland_worker.py`) — add
`opentelemetry-distro` and `opentelemetry-exporter-otlp` to `worker/requirements.txt`, run
under `opentelemetry-instrument python csmoney_worker.py`, same env vars with
`OTEL_SERVICE_NAME=csmoney-worker` / `skinland-worker`. (Today the workers emit structured
JSON logs to stdout — `LOG_JSON=1`, set by default in the image; an interim option is to
ship their Docker stdout to Loki with an Alloy `loki.source.docker` component on the app
host, which can parse those JSON fields directly, instead of instrumenting in-process.)
Add those env vars to the matching `docker-compose.yml` services when the instrumentation lands.
## Hardening
- **Firewall the OTLP ports.** `4317`/`4318` are bound to `0.0.0.0`. Restrict them to the app
host, e.g. `ufw allow from <app-host-ip> to any port 4317,4318 proto tcp`.
- **Auth on ingest (optional).** Add an `otelcol.auth.bearer` handler to
`otelcol.receiver.otlp` in `alloy/config.alloy` and send a matching
`OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer <token>` from the apps.
- **Grafana password.** Change `admin` on first login, or set
`GF_SECURITY_ADMIN_PASSWORD` in `/etc/grafana/grafana.ini`.
## Retention / sizing
Defaults are LXC-friendly: Prometheus **15d**, Loki **15d**, Tempo **7d**. Bump the
`retention.time` flag (`prometheus.service`), `limits_config.retention_period` (`loki.yml`),
and `compactor.compaction.block_retention` (`tempo.yml`) if you have the disk. Re-run
`install.sh` to apply config edits.
```