spark-operator helm chart 설치
# spark-operator helm repository 추가
❯ helm repo add spark-operator https://kubeflow.github.io/spark-operator
# spark-operator helm chart 설치
❯ helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set spark.jobNamespaces={spark-jobs}
# 설치 확인
❯ kubectl get all -n spark-operator
NAME READY STATUS RESTARTS AGE
pod/spark-operator-controller-598d475647-mknvd 1/1 Running 0 6m56s
pod/spark-operator-webhook-565b79b589-d6cfk 1/1 Running 0 6m56s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/spark-operator-webhook-svc ClusterIP 10.102.148.40 <none> 9443/TCP 6m56s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/spark-operator-controller 1/1 1 1 6m56s
deployment.apps/spark-operator-webhook 1/1 1 1 6m56s
NAME DESIRED CURRENT READY AGE
replicaset.apps/spark-operator-controller-598d475647 1 1 1 6m56s
replicaset.apps/spark-operator-webhook-565b79b589 1 1 1 6m56s
spark-operator metrics endpoint 확인
일반적으로 prometheus를 사용해서 monitoring 할 땐 service monitor를 생성해서 사용하는데
spark-operator는 service monitor 말고 pod monitor를 사용하는듯함 (github issue를 보면 metrics를 활성화하면 service가 배포되지 않는다고 하고, 실제로 확인했을 때도 webhook 말고는 배포된 service가 없었음)
spark-operator values.yaml 파일에 정의된 prometheus metrics 관련 기본 값은 아래와 같음
spark-operator 설정에서도 podMonitor를 생성할 수도 있지만 prometheus 설정에서 생성할 것임 (prometheus.podMonitor.create=false로 유지)
prometheus:
metrics:
# -- Specifies whether to enable prometheus metrics scraping.
enable: true
# -- Metrics port.
port: 8080
# -- Metrics port name.
portName: metrics
# -- Metrics serving endpoint.
endpoint: /metrics
# -- Metrics prefix, will be added to all exported metrics.
prefix: ""
# -- Job Start Latency histogram buckets. Specified in seconds.
jobStartLatencyBuckets: "30,60,90,120,150,180,210,240,270,300"
# Prometheus pod monitor for controller pods
podMonitor:
# -- Specifies whether to create pod monitor.
# Note that prometheus metrics should be enabled as well.
create: false
# -- Pod monitor labels
labels: {}
# -- The label to use to retrieve the job name from
jobLabel: spark-operator-podmonitor
# -- Prometheus metrics endpoint properties. `metrics.portName` will be used as a port
podMetricsEndpoint:
scheme: http
interval: 5s
Prometheus config 설정
prometheus helm chart 설치할 때 config에 spark-operator podMonitor 관련 설정을 추가해준다.
# custom-values.yaml
prometheus:
additionalPodMonitors:
- name: spark-operator
namespaceSelector:
matchNames:
- spark-operator
selector:
matchLabels:
app.kubernetes.io/name: spark-operator
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 5s
prometheus helm chart를 설치할 때 위에서 작성한 custom-values.yaml 파일을 넘겨준다.
# repo 추가
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# helm chart 설치
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace -f custom-values.yaml
# helm chart 설치 확인
❯ kubectl get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 4m14s
pod/prometheus-grafana-995866d9b-pv9r7 3/3 Running 0 4m15s
pod/prometheus-kube-prometheus-operator-565cbd8649-plm2c 1/1 Running 0 4m15s
pod/prometheus-kube-state-metrics-849c746cf5-nkjjb 1/1 Running 0 4m15s
pod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 4m14s
pod/prometheus-prometheus-node-exporter-kqdws 1/1 Running 0 4m15s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 4m14s
service/prometheus-grafana ClusterIP 10.106.190.0 <none> 80/TCP 4m15s
service/prometheus-kube-prometheus-alertmanager ClusterIP 10.106.227.167 <none> 9093/TCP,8080/TCP 4m15s
service/prometheus-kube-prometheus-operator ClusterIP 10.98.152.26 <none> 443/TCP 4m15s
service/prometheus-kube-prometheus-prometheus ClusterIP 10.105.251.35 <none> 9090/TCP,8080/TCP 4m15s
service/prometheus-kube-state-metrics ClusterIP 10.97.95.107 <none> 8080/TCP 4m15s
service/prometheus-operated ClusterIP None <none> 9090/TCP 4m14s
service/prometheus-prometheus-node-exporter ClusterIP 10.101.172.93 <none> 9100/TCP 4m15s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/prometheus-prometheus-node-exporter 1 1 1 1 1 kubernetes.io/os=linux 4m15s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/prometheus-grafana 1/1 1 1 4m15s
deployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 4m15s
deployment.apps/prometheus-kube-state-metrics 1/1 1 1 4m15s
NAME DESIRED CURRENT READY AGE
replicaset.apps/prometheus-grafana-995866d9b 1 1 1 4m15s
replicaset.apps/prometheus-kube-prometheus-operator-565cbd8649 1 1 1 4m15s
replicaset.apps/prometheus-kube-state-metrics-849c746cf5 1 1 1 4m15s
NAME READY AGE
statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager 1/1 4m14s
statefulset.apps/prometheus-prometheus-kube-prometheus-prometheus 1/1 4m14s
# 생성된 podmonitor 확인
❯ kubectl get podmonitor -A
NAMESPACE NAME AGE
monitoring spark-operator 7m31s
Prometheus web 에서 spark-operator metrics 확인
ingress 설정하기 귀찮으니 portforward 해서 localhost 도메인으로 확인
아래 명령어 실행 후 웹브라우저에서 localhost:9090 에 접속해서 metric 이 정상적으로 수집되는지 확인한다.
# prometheus web port-forwarding
❯ kubectl port-forward pods/prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 -n monitoring
localhost:9090/targets 에서 아래와 같이 정상 상태 확인
localhost:9090/query 에서 아래와 같이 metric query 가능
spark operator 에서 생성하는 metrics 종류
https://www.kubeflow.org/docs/components/spark-operator/getting-started/#spark-application-metrics
Spark Application Metrics
- spark_application_count: Total number of SparkApplication handled by the Operator.
- spark_application_submit_count: Total number of SparkApplication spark-submitted by the Operator.
- spark_application_success_count: Total number of SparkApplication which completed successfully.
- spark_application_failure_count: Total number of SparkApplication which failed to complete.
- spark_application_running_count: Total number of SparkApplication which are currently running.
- spark_application_success_execution_time_seconds: Execution time for applications which succeeded.
- spark_application_failure_execution_time_seconds: Execution time for applications which failed.
- spark_application_start_latency_seconds: Start latency of SparkApplication as type of Prometheus Summary.
- spark_application_start_latency_seconds: Start latency of SparkApplication as type of Prometheus Histogram.
- spark_executor_success_count: Total number of Spark Executors which completed successfully.
- spark_executor_failure_count: Total number of Spark Executors which failed.
- spark_executor_running_count: Total number of Spark Executors which are currently running.
Work Queue Metrics
- workqueue_depth: Current depth of workqueue
- workqueue_adds_total: Total number of adds handled by workqueue
- workqueue_queue_duration_seconds_bucket: How long in seconds an item stays in workqueue before being requested
- workqueue_work_duration_seconds_bucket: How long in seconds processing an item from workqueue takes
- workqueue_retries_total: Total number of retries handled by workqueue
- workqueue_unfinished_work_seconds: Unfinished work in seconds
- workqueue_longest_running_processor_seconds: Longest running processor in seconds
'IT' 카테고리의 다른 글
쿠버네티스 클러스터에 프로메테우스 설치 (0) | 2025.04.10 |
---|---|
옵시디언으로 무료 디지털 가든 만들기 (0) | 2025.02.10 |
내 컴퓨터에 달린 GPU로 LLM 모델 돌릴 수 있을지 계산해보는 사이트 (0) | 2025.01.22 |
데이터분석가에서 AI 백엔드 엔지니어가 되려면 어떤 걸 공부해야 할까 (5) | 2024.10.23 |
제 2402회 리눅스마스터 2급 2차 가답안 시험 후기 (1) | 2024.06.13 |
댓글