almost ready
This commit is contained in:
148
monitoring/README.md
Normal file
148
monitoring/README.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# BlueLaminate observability stack (standalone, Proxmox LXC)
|
||||
|
||||
A self-contained Grafana **LGTM** stack — **L**oki (logs), **G**rafana (dashboards),
|
||||
**T**empo (traces), and Prometheus (**M**etrics) — fronted by **Grafana Alloy** as a single
|
||||
OTLP ingress. It runs as native systemd services on its own Proxmox LXC, decoupled from the
|
||||
app's `docker-compose.yml`. The C2 and Python workers push OpenTelemetry data to Alloy, which
|
||||
fans the three signals out to the backends; Grafana ties them together.
|
||||
|
||||
```
|
||||
C2 / workers ──OTLP(4317 grpc / 4318 http)──► Alloy ──┬─► Loki (logs, :3100)
|
||||
(other host) ├─► Prometheus (metrics, :9090, remote-write)
|
||||
└─► Tempo (traces, :4319 OTLP → store)
|
||||
│
|
||||
Grafana (:3000)
|
||||
datasources: Loki + Prometheus + Tempo
|
||||
```
|
||||
|
||||
Only Alloy's OTLP ports (`4317`/`4318`) and Grafana (`3000`) need to be reachable from the
|
||||
LAN. Loki and Tempo bind localhost; Alloy is the only client that talks to them.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
monitoring/
|
||||
install.sh # idempotent provisioner — run as root in the LXC
|
||||
alloy/config.alloy # OTLP receiver → batch → Loki / Prometheus / Tempo
|
||||
prometheus/prometheus.yml # self-monitoring scrapes (app metrics arrive via remote-write)
|
||||
prometheus/prometheus.service # systemd unit: remote-write + OTLP receivers, 15d retention
|
||||
loki/loki.yml # single-binary, filesystem store, 15d retention
|
||||
tempo/tempo.yml # OTLP on :4319, local store, metrics_generator → Prometheus
|
||||
grafana/datasources.yml # Loki + Prometheus(default) + Tempo, correlated
|
||||
grafana/dashboards.yml # file-based dashboard provider
|
||||
grafana/dashboards/overview.json # starter dashboard (target health, span rates, logs)
|
||||
```
|
||||
|
||||
## 1. Create the LXC (run on the Proxmox host)
|
||||
|
||||
Reference only — adjust the storage, bridge, and template names to your node. An unprivileged
|
||||
Debian 13 container with ~2 vCPU / 2–4 GB RAM / 20–40 GB disk is plenty.
|
||||
|
||||
```bash
|
||||
# Make sure a Debian 13 template is present (once):
|
||||
# pveam update && pveam available | grep debian-13
|
||||
# pveam download local debian-13-standard_*_amd64.tar.zst
|
||||
|
||||
pct create 910 local:vztmpl/debian-13-standard_13.0-1_amd64.tar.zst \
|
||||
--hostname grafana-lxc \
|
||||
--cores 2 --memory 4096 --swap 1024 \
|
||||
--rootfs local-lvm:32 \
|
||||
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
|
||||
--unprivileged 1 --features nesting=0 \
|
||||
--onboot 1 --start 1
|
||||
|
||||
# (Optional) give it a static IP instead of dhcp, e.g.
|
||||
# --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1
|
||||
```
|
||||
|
||||
`nesting=0` is fine — there's no Docker here, just native binaries.
|
||||
|
||||
## 2. Deploy the stack (inside the LXC)
|
||||
|
||||
```bash
|
||||
pct enter 910 # or: ssh root@<lxc-ip>
|
||||
apt-get update && apt-get install -y git
|
||||
git clone <this-repo-url> /opt/bluelaminate
|
||||
cd /opt/bluelaminate/monitoring
|
||||
sudo bash install.sh
|
||||
```
|
||||
|
||||
No git on the LXC? Copy just this folder over instead:
|
||||
`scp -r monitoring root@<lxc-ip>:/opt/monitoring && ssh root@<lxc-ip> 'cd /opt/monitoring && bash install.sh'`
|
||||
|
||||
The script adds the Grafana apt repo, installs grafana/loki/tempo/alloy, drops the Prometheus
|
||||
release binary into `/opt/prometheus`, lays our configs over the packaged defaults, and
|
||||
enables all five services. It prints the URLs and the OTLP endpoint when done.
|
||||
|
||||
## 3. Verify
|
||||
|
||||
```bash
|
||||
systemctl is-active grafana-server loki tempo prometheus alloy # all → active
|
||||
curl -s localhost:3100/ready # Loki → ready
|
||||
curl -s localhost:3200/ready # Tempo → ready
|
||||
curl -s localhost:9090/-/ready # Prometheus → Ready
|
||||
```
|
||||
|
||||
Open Grafana at `http://<lxc-ip>:3000` (first login `admin` / `admin` — change it). The three
|
||||
datasources and the **BlueLaminate → Stack Overview** dashboard are provisioned automatically.
|
||||
Alloy's pipeline graph is at `http://<lxc-ip>:12345`.
|
||||
|
||||
### End-to-end OTLP smoke test (no app changes needed)
|
||||
|
||||
Send synthetic telemetry from any machine that can reach the LXC, using the OpenTelemetry
|
||||
`telemetrygen` tool (`go install github.com/open-telemetry/opentelemetry-collector-contrib/cmd/telemetrygen@latest`):
|
||||
|
||||
```bash
|
||||
telemetrygen traces --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --traces 5
|
||||
telemetrygen metrics --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --duration 10s
|
||||
telemetrygen logs --otlp-endpoint <lxc-ip>:4317 --otlp-insecure --logs 5
|
||||
```
|
||||
|
||||
Then in Grafana **Explore**: pick **Tempo** (search recent traces), **Prometheus** (query
|
||||
`gen`), and **Loki** (`{service_name=~".+"}`) — seeing data in all three confirms the full
|
||||
fan-out before any app is wired up.
|
||||
|
||||
## 4. Wiring the apps later (the OTLP contract)
|
||||
|
||||
This deployment is **stack-only**; the C2 and workers aren't instrumented yet. When you do,
|
||||
point them at this LXC — nothing here changes. The drop-in:
|
||||
|
||||
**.NET C2** (`BlueLaminate.C2`) — add packages `OpenTelemetry.Extensions.Hosting`,
|
||||
`OpenTelemetry.Exporter.OpenTelemetryProtocol`, and the
|
||||
`OpenTelemetry.Instrumentation.AspNetCore` / `.Http` / runtime instrumentations, then
|
||||
`builder.Services.AddOpenTelemetry().WithTracing(...).WithMetrics(...)` plus
|
||||
`builder.Logging.AddOpenTelemetry(...)`. Configure via env:
|
||||
|
||||
```
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://<lxc-ip>:4318
|
||||
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
|
||||
OTEL_SERVICE_NAME=bluelaminate-c2
|
||||
```
|
||||
|
||||
**Python workers** (`worker/csmoney_worker.py`, `skinland_worker.py`) — add
|
||||
`opentelemetry-distro` and `opentelemetry-exporter-otlp` to `worker/requirements.txt`, run
|
||||
under `opentelemetry-instrument python csmoney_worker.py`, same env vars with
|
||||
`OTEL_SERVICE_NAME=csmoney-worker` / `skinland-worker`. (Today the workers emit structured
|
||||
JSON logs to stdout — `LOG_JSON=1`, set by default in the image; an interim option is to
|
||||
ship their Docker stdout to Loki with an Alloy `loki.source.docker` component on the app
|
||||
host, which can parse those JSON fields directly, instead of instrumenting in-process.)
|
||||
|
||||
Add those env vars to the matching `docker-compose.yml` services when the instrumentation lands.
|
||||
|
||||
## Hardening
|
||||
|
||||
- **Firewall the OTLP ports.** `4317`/`4318` are bound to `0.0.0.0`. Restrict them to the app
|
||||
host, e.g. `ufw allow from <app-host-ip> to any port 4317,4318 proto tcp`.
|
||||
- **Auth on ingest (optional).** Add an `otelcol.auth.bearer` handler to
|
||||
`otelcol.receiver.otlp` in `alloy/config.alloy` and send a matching
|
||||
`OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer <token>` from the apps.
|
||||
- **Grafana password.** Change `admin` on first login, or set
|
||||
`GF_SECURITY_ADMIN_PASSWORD` in `/etc/grafana/grafana.ini`.
|
||||
|
||||
## Retention / sizing
|
||||
|
||||
Defaults are LXC-friendly: Prometheus **15d**, Loki **15d**, Tempo **7d**. Bump the
|
||||
`retention.time` flag (`prometheus.service`), `limits_config.retention_period` (`loki.yml`),
|
||||
and `compactor.compaction.block_retention` (`tempo.yml`) if you have the disk. Re-run
|
||||
`install.sh` to apply config edits.
|
||||
```
|
||||
Reference in New Issue
Block a user