Lizenzserver ist fertig

Dieser Commit ist enthalten in:
2025-06-18 23:22:38 +02:00
Ursprung 6d1a52b7e3
Commit 7017549fcd
21 geänderte Dateien mit 1650 neuen und 5 gelöschten Zeilen

272
monitoring/README.md Normale Datei
Datei anzeigen

@@ -0,0 +1,272 @@
# V2 Docker Monitoring Stack
## Übersicht
Die Monitoring-Lösung für V2 Docker basiert auf dem Prometheus-Stack und bietet umfassende Einblicke in die Performance und Gesundheit aller Services.
## Komponenten
### 1. **Prometheus** (Port 9090)
- Zentrale Metrik-Sammlung
- Konfigurierte Scrape-Jobs für alle Services
- 30 Tage Datenaufbewahrung
- Alert-Rules für kritische Ereignisse
### 2. **Grafana** (Port 3000)
- Visualisierung der Metriken
- Vorkonfigurierte Dashboards
- Alerting-Integration
- Standard-Login: admin/admin (beim ersten Login ändern)
### 3. **Alertmanager** (Port 9093)
- Alert-Routing und -Gruppierung
- Email-Benachrichtigungen
- Webhook-Integration
- Alert-Silencing und -Inhibition
### 4. **Exporters**
- **PostgreSQL Exporter**: Datenbank-Metriken
- **Redis Exporter**: Cache-Metriken
- **Node Exporter**: System-Metriken
- **Nginx Exporter**: Proxy-Metriken
## Installation
### 1. Monitoring-Stack starten
```bash
cd monitoring
docker-compose -f docker-compose.monitoring.yml up -d
```
### 2. Services überprüfen
```bash
docker-compose -f docker-compose.monitoring.yml ps
```
### 3. Grafana-Zugang
1. Öffnen Sie https://monitoring.v2-docker.com (oder http://localhost:3000)
2. Login mit admin/admin
3. Neues Passwort setzen
4. Dashboard "License Server Overview" öffnen
## Konfiguration
### Environment-Variablen
Erstellen Sie eine `.env` Datei im monitoring-Verzeichnis:
```env
# Grafana
GRAFANA_USER=admin
GRAFANA_PASSWORD=secure-password
# PostgreSQL Connection
POSTGRES_PASSWORD=your-postgres-password
# Alertmanager SMTP
SMTP_USERNAME=alerts@yourdomain.com
SMTP_PASSWORD=smtp-password
# Webhook URLs
WEBHOOK_CRITICAL=https://your-webhook-url/critical
WEBHOOK_SECURITY=https://your-webhook-url/security
```
### Alert-Konfiguration
Alerts sind in `prometheus/rules/license-server-alerts.yml` definiert:
- **HighLicenseValidationErrorRate**: Fehlerrate > 5%
- **PossibleLicenseAbuse**: Verdächtige Aktivitäten
- **LicenseServerDown**: Service nicht erreichbar
- **HighLicenseValidationLatency**: Antwortzeit > 500ms
- **DatabaseConnectionPoolExhausted**: DB-Verbindungen > 90%
### Neue Alerts hinzufügen
1. Editieren Sie `prometheus/rules/license-server-alerts.yml`
2. Fügen Sie neue Alert-Regel hinzu:
```yaml
- alert: YourAlertName
expr: your_prometheus_query > threshold
for: 5m
labels:
severity: warning
service: your-service
annotations:
summary: "Alert summary"
description: "Detailed description"
```
3. Prometheus neu laden:
```bash
curl -X POST http://localhost:9090/-/reload
```
## Dashboards
### License Server Overview
Zeigt wichtige Metriken:
- Aktive Lizenzen
- Validierungen pro Sekunde
- Fehlerrate
- Response Time Percentiles
- Anomalie-Erkennung
- Top 10 aktivste Lizenzen
### Neue Dashboards erstellen
1. In Grafana einloggen
2. Create → Dashboard
3. Panel hinzufügen
4. Prometheus-Query eingeben
5. Dashboard speichern
6. Export als JSON für Backup
## Metriken
### License Server Metriken
- `license_validation_total`: Anzahl der Validierungen
- `license_validation_duration_seconds`: Validierungs-Dauer
- `active_licenses_total`: Aktive Lizenzen
- `anomaly_detections_total`: Erkannte Anomalien
### System Metriken
- `node_cpu_seconds_total`: CPU-Auslastung
- `node_memory_MemAvailable_bytes`: Verfügbarer Speicher
- `node_filesystem_avail_bytes`: Verfügbarer Festplattenspeicher
### Datenbank Metriken
- `pg_stat_database_numbackends`: Aktive DB-Verbindungen
- `pg_stat_database_tup_fetched`: Abgerufene Tupel
- `pg_stat_database_conflicts`: Konflikte
## Troubleshooting
### Prometheus erreicht Service nicht
1. Netzwerk überprüfen:
```bash
docker network inspect v2_internal_net
```
2. Service-Discovery testen:
```bash
docker exec prometheus wget -O- http://license-server:8443/metrics
```
### Keine Daten in Grafana
1. Datasource überprüfen:
- Settings → Data Sources → Prometheus
- Test Connection
2. Prometheus Targets checken:
- http://localhost:9090/targets
- Alle Targets sollten "UP" sein
### Alerts werden nicht gesendet
1. Alertmanager Logs prüfen:
```bash
docker logs alertmanager
```
2. SMTP-Konfiguration verifizieren
3. Webhook-URLs testen
## Wartung
### Backup
1. Prometheus-Daten:
```bash
docker exec prometheus tar czf /prometheus/backup.tar.gz /prometheus
docker cp prometheus:/prometheus/backup.tar.gz ./backups/
```
2. Grafana-Dashboards:
- Export über UI als JSON
- Speichern in `grafana/dashboards/`
### Updates
1. Images updaten:
```bash
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d
```
2. Konfiguration neu laden:
```bash
# Prometheus
curl -X POST http://localhost:9090/-/reload
# Alertmanager
curl -X POST http://localhost:9093/-/reload
```
## Performance-Optimierung
### Retention anpassen
In `docker-compose.monitoring.yml`:
```yaml
command:
- '--storage.tsdb.retention.time=15d' # Reduzieren für weniger Speicher
```
### Scrape-Intervalle
In `prometheus/prometheus.yml`:
```yaml
global:
scrape_interval: 30s # Erhöhen für weniger Last
```
### Resource Limits
Passen Sie die Limits in `docker-compose.monitoring.yml` an Ihre Umgebung an.
## Sicherheit
1. **Grafana**: Ändern Sie das Standard-Passwort sofort
2. **Prometheus**: Kein öffentlicher Zugriff (nur intern)
3. **Alertmanager**: Webhook-URLs geheim halten
4. **Exporters**: Nur im internen Netzwerk erreichbar
## Integration
### In CI/CD Pipeline
```bash
# Deployment-Metriken senden
curl -X POST http://prometheus-pushgateway:9091/metrics/job/deployment \
-d 'deployment_status{version="1.2.3",environment="production"} 1'
```
### Custom Metriken
In Ihrer Anwendung:
```python
from prometheus_client import Counter, Histogram
custom_metric = Counter('my_custom_total', 'Description')
custom_metric.inc()
```
## Support
Bei Problemen:
1. Logs überprüfen: `docker-compose -f docker-compose.monitoring.yml logs [service]`
2. Dokumentation: https://prometheus.io/docs/
3. Grafana Docs: https://grafana.com/docs/

Datei anzeigen

@@ -0,0 +1,94 @@
global:
resolve_timeout: 5m
smtp_from: 'alerts@v2-docker.com'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_auth_username: '${SMTP_USERNAME}'
smtp_auth_password: '${SMTP_PASSWORD}'
smtp_require_tls: true
# Templates for notifications
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route tree
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts
- match:
severity: critical
receiver: 'critical'
continue: true
# License abuse alerts
- match:
alertname: PossibleLicenseAbuse
receiver: 'security'
repeat_interval: 1h
# Database alerts
- match:
service: postgres
receiver: 'database'
# Infrastructure alerts
- match_re:
alertname: ^(HighCPUUsage|HighMemoryUsage|LowDiskSpace)$
receiver: 'infrastructure'
# Receivers
receivers:
- name: 'default'
email_configs:
- to: 'admin@v2-docker.com'
headers:
Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
html: |
<h2>Alert: {{ .GroupLabels.alertname }}</h2>
<p><b>Status:</b> {{ .Status }}</p>
{{ range .Alerts }}
<hr>
<p><b>Summary:</b> {{ .Annotations.summary }}</p>
<p><b>Description:</b> {{ .Annotations.description }}</p>
<p><b>Labels:</b></p>
<ul>
{{ range .Labels.SortedPairs }}
<li><b>{{ .Name }}:</b> {{ .Value }}</li>
{{ end }}
</ul>
{{ end }}
- name: 'critical'
email_configs:
- to: 'critical-alerts@v2-docker.com'
send_resolved: true
webhook_configs:
- url: '${WEBHOOK_CRITICAL}'
send_resolved: true
- name: 'security'
email_configs:
- to: 'security@v2-docker.com'
webhook_configs:
- url: '${WEBHOOK_SECURITY}'
- name: 'database'
email_configs:
- to: 'dba@v2-docker.com'
- name: 'infrastructure'
email_configs:
- to: 'ops@v2-docker.com'
# Inhibition rules
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

Datei anzeigen

@@ -0,0 +1,149 @@
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- v2_internal_net
ports:
- "9090:9090"
deploy:
resources:
limits:
cpus: '1'
memory: 2g
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://monitoring.v2-docker.com
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource,grafana-piechart-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
networks:
- v2_internal_net
ports:
- "3000:3000"
depends_on:
- prometheus
deploy:
resources:
limits:
cpus: '0.5'
memory: 512m
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- v2_internal_net
ports:
- "9093:9093"
deploy:
resources:
limits:
cpus: '0.5'
memory: 256m
# PostgreSQL Exporter
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
restart: unless-stopped
environment:
DATA_SOURCE_NAME: "postgresql://postgres:${POSTGRES_PASSWORD}@postgres:5432/v2_adminpanel?sslmode=disable"
networks:
- v2_internal_net
deploy:
resources:
limits:
cpus: '0.25'
memory: 128m
# Redis Exporter
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
restart: unless-stopped
environment:
REDIS_ADDR: "redis://redis:6379"
networks:
- v2_internal_net
deploy:
resources:
limits:
cpus: '0.25'
memory: 128m
# Node Exporter (for host metrics)
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- v2_internal_net
deploy:
resources:
limits:
cpus: '0.25'
memory: 128m
# Nginx Exporter
nginx-exporter:
image: nginx/nginx-prometheus-exporter:latest
container_name: nginx-exporter
restart: unless-stopped
command:
- '-nginx.scrape-uri=http://nginx-proxy:8080/nginx_status'
networks:
- v2_internal_net
deploy:
resources:
limits:
cpus: '0.25'
memory: 128m
networks:
v2_internal_net:
external: true
volumes:
prometheus_data:
grafana_data:
alertmanager_data:

Datei anzeigen

@@ -0,0 +1,562 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
}
},
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "sum(active_licenses_total)",
"refId": "A"
}
],
"title": "Active Licenses",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "ops"
}
},
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 0
},
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "sum(rate(license_validation_total[5m]))",
"refId": "A"
}
],
"title": "Validations/sec",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.01
},
{
"color": "red",
"value": 0.05
}
]
},
"unit": "percentunit"
}
},
"gridPos": {
"h": 4,
"w": 6,
"x": 12,
"y": 0
},
"id": 3,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "sum(rate(license_validation_errors_total[5m])) / sum(rate(license_validation_total[5m]))",
"refId": "A"
}
],
"title": "Error Rate",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 200
},
{
"color": "red",
"value": 500
}
]
},
"unit": "ms"
}
},
"gridPos": {
"h": 4,
"w": 6,
"x": 18,
"y": 0
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"text": {},
"textMode": "auto"
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(license_validation_duration_seconds_bucket[5m])) by (le)) * 1000",
"refId": "A"
}
],
"title": "95th Percentile Latency",
"type": "stat"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "reqps"
}
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 4
},
"id": 5,
"options": {
"tooltip": {
"mode": "single"
},
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
}
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "sum(rate(license_validation_total{result=\"success\"}[5m]))",
"legendFormat": "Success",
"refId": "A"
},
{
"expr": "sum(rate(license_validation_total{result=\"invalid\"}[5m]))",
"legendFormat": "Invalid",
"refId": "B"
},
{
"expr": "sum(rate(license_validation_total{result=\"expired\"}[5m]))",
"legendFormat": "Expired",
"refId": "C"
}
],
"title": "License Validation Rate",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "ms"
}
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 4
},
"id": 6,
"options": {
"tooltip": {
"mode": "single"
},
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
}
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(license_validation_duration_seconds_bucket[5m])) by (le)) * 1000",
"legendFormat": "50th percentile",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(license_validation_duration_seconds_bucket[5m])) by (le)) * 1000",
"legendFormat": "95th percentile",
"refId": "B"
},
{
"expr": "histogram_quantile(0.99, sum(rate(license_validation_duration_seconds_bucket[5m])) by (le)) * 1000",
"legendFormat": "99th percentile",
"refId": "C"
}
],
"title": "Response Time Percentiles",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
}
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 12
},
"id": 7,
"options": {
"tooltip": {
"mode": "single"
},
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
}
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "sum(rate(anomaly_detections_total{severity=\"low\"}[5m]))",
"legendFormat": "Low",
"refId": "A"
},
{
"expr": "sum(rate(anomaly_detections_total{severity=\"medium\"}[5m]))",
"legendFormat": "Medium",
"refId": "B"
},
{
"expr": "sum(rate(anomaly_detections_total{severity=\"high\"}[5m]))",
"legendFormat": "High",
"refId": "C"
},
{
"expr": "sum(rate(anomaly_detections_total{severity=\"critical\"}[5m]))",
"legendFormat": "Critical",
"refId": "D"
}
],
"title": "Anomaly Detection Rate by Severity",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"displayMode": "auto"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
}
}
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 12
},
"id": 8,
"options": {
"showHeader": true
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "topk(10, sum by (license_id) (rate(license_validation_total[1h])))",
"format": "table",
"instant": true,
"refId": "A"
}
],
"title": "Top 10 Most Active Licenses (Last Hour)",
"type": "table"
}
],
"refresh": "10s",
"schemaVersion": 27,
"style": "dark",
"tags": ["license-server", "monitoring"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "License Server Overview",
"uid": "license-server-overview",
"version": 0
}

Datei anzeigen

@@ -0,0 +1,12 @@
apiVersion: 1
providers:
- name: 'V2 Docker Dashboards'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards

Datei anzeigen

@@ -0,0 +1,13 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: 15s
queryTimeout: 60s
httpMethod: POST

Datei anzeigen

@@ -0,0 +1,111 @@
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'v2-docker-monitor'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them
rule_files:
- '/etc/prometheus/rules/*.yml'
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
service: 'prometheus'
# License Server metrics
- job_name: 'license-server'
metrics_path: '/metrics'
static_configs:
- targets: ['license-server:8443']
labels:
service: 'license-server'
component: 'api'
# Auth Service metrics
- job_name: 'auth-service'
metrics_path: '/metrics'
static_configs:
- targets: ['auth-service:5001']
labels:
service: 'auth-service'
component: 'authentication'
# Analytics Service metrics
- job_name: 'analytics-service'
metrics_path: '/metrics'
static_configs:
- targets: ['analytics-service:5003']
labels:
service: 'analytics-service'
component: 'analytics'
# Admin API Service metrics
- job_name: 'admin-api-service'
metrics_path: '/metrics'
static_configs:
- targets: ['admin-api-service:5004']
labels:
service: 'admin-api-service'
component: 'admin'
# Admin Panel metrics
- job_name: 'admin-panel'
metrics_path: '/metrics'
static_configs:
- targets: ['admin-panel:5000']
labels:
service: 'admin-panel'
component: 'ui'
# PostgreSQL Exporter
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
labels:
service: 'postgres'
component: 'database'
# Redis Exporter
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
labels:
service: 'redis'
component: 'cache'
# RabbitMQ metrics
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']
labels:
service: 'rabbitmq'
component: 'messaging'
# Node Exporter for host metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
service: 'node-exporter'
component: 'infrastructure'
# Nginx metrics
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
labels:
service: 'nginx'
component: 'proxy'

Datei anzeigen

@@ -0,0 +1,174 @@
groups:
- name: license_server_alerts
interval: 30s
rules:
# High error rate
- alert: HighLicenseValidationErrorRate
expr: |
(
sum(rate(license_validation_errors_total[5m]))
/
sum(rate(license_validation_total[5m]))
) > 0.05
for: 5m
labels:
severity: warning
service: license-server
annotations:
summary: "High license validation error rate ({{ $value | humanizePercentage }})"
description: "License validation error rate is above 5% for the last 5 minutes"
# License abuse detection
- alert: PossibleLicenseAbuse
expr: |
rate(license_validation_total{result="multiple_ips"}[5m]) > 0.1
for: 10m
labels:
severity: critical
service: license-server
annotations:
summary: "Possible license abuse detected"
description: "High rate of validations from multiple IPs for same license"
# Service down
- alert: LicenseServerDown
expr: up{job="license-server"} == 0
for: 2m
labels:
severity: critical
service: license-server
annotations:
summary: "License server is down"
description: "License server has been down for more than 2 minutes"
# High response time
- alert: HighLicenseValidationLatency
expr: |
histogram_quantile(0.95,
sum(rate(license_validation_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
service: license-server
annotations:
summary: "High license validation latency"
description: "95th percentile latency is above 500ms"
# Anomaly detection
- alert: HighAnomalyDetectionRate
expr: |
sum(rate(anomaly_detections_total{severity=~"high|critical"}[5m])) > 0.5
for: 5m
labels:
severity: critical
service: license-server
annotations:
summary: "High rate of critical anomalies detected"
description: "More than 0.5 critical anomalies per second detected"
- name: database_alerts
interval: 30s
rules:
# Database connection pool exhaustion
- alert: DatabaseConnectionPoolExhausted
expr: |
(
pg_stat_database_numbackends{datname="v2_adminpanel"}
/
pg_settings_max_connections
) > 0.9
for: 5m
labels:
severity: critical
service: postgres
annotations:
summary: "Database connection pool nearly exhausted"
description: "PostgreSQL connection usage is above 90%"
# Database replication lag
- alert: DatabaseReplicationLag
expr: |
pg_replication_lag_seconds > 10
for: 5m
labels:
severity: warning
service: postgres
annotations:
summary: "Database replication lag detected"
description: "Replication lag is {{ $value }} seconds"
- name: infrastructure_alerts
interval: 30s
rules:
# High CPU usage
- alert: HighCPUUsage
expr: |
(
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 10 minutes"
# High memory usage
- alert: HighMemoryUsage
expr: |
(
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90%"
# Disk space
- alert: LowDiskSpace
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Less than 10% disk space remaining"
- name: cache_alerts
interval: 30s
rules:
# Redis connection errors
- alert: RedisConnectionErrors
expr: |
rate(redis_connection_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
service: redis
annotations:
summary: "Redis connection errors detected"
description: "Redis connection error rate is {{ $value }} per second"
# Cache hit rate
- alert: LowCacheHitRate
expr: |
(
redis_keyspace_hits_total
/
(redis_keyspace_hits_total + redis_keyspace_misses_total)
) < 0.7
for: 10m
labels:
severity: warning
service: redis
annotations:
summary: "Low Redis cache hit rate"
description: "Cache hit rate is below 70%"