본문 바로가기
IT

spark-operator 설치, prometheus 연동해서 metric 수집

by ShrimpTaco 2025. 4. 11.

spark-operator helm chart 설치

# spark-operator helm repository 추가
❯ helm repo add spark-operator https://kubeflow.github.io/spark-operator

# spark-operator helm chart 설치
❯ helm install spark-operator spark-operator/spark-operator \
    --namespace spark-operator \
    --create-namespace \
    --set webhook.enable=true \
    --set spark.jobNamespaces={spark-jobs}

# 설치 확인
❯ kubectl get all -n spark-operator
NAME                                             READY   STATUS    RESTARTS   AGE
pod/spark-operator-controller-598d475647-mknvd   1/1     Running   0          6m56s
pod/spark-operator-webhook-565b79b589-d6cfk      1/1     Running   0          6m56s

NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/spark-operator-webhook-svc   ClusterIP   10.102.148.40   <none>        9443/TCP   6m56s

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/spark-operator-controller   1/1     1            1           6m56s
deployment.apps/spark-operator-webhook      1/1     1            1           6m56s

NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/spark-operator-controller-598d475647   1         1         1       6m56s
replicaset.apps/spark-operator-webhook-565b79b589      1         1         1       6m56s

spark-operator metrics endpoint 확인

일반적으로 prometheus를 사용해서 monitoring 할 땐 service monitor를 생성해서 사용하는데
spark-operator는 service monitor 말고 pod monitor를 사용하는듯함 (github issue를 보면 metrics를 활성화하면 service가 배포되지 않는다고 하고, 실제로 확인했을 때도 webhook 말고는 배포된 service가 없었음)

spark-operator values.yaml 파일에 정의된 prometheus metrics 관련 기본 값은 아래와 같음
spark-operator 설정에서도 podMonitor를 생성할 수도 있지만 prometheus 설정에서 생성할 것임 (prometheus.podMonitor.create=false로 유지)

prometheus:
  metrics:
    # -- Specifies whether to enable prometheus metrics scraping.
    enable: true
    # -- Metrics port.
    port: 8080
    # -- Metrics port name.
    portName: metrics
    # -- Metrics serving endpoint.
    endpoint: /metrics
    # -- Metrics prefix, will be added to all exported metrics.
    prefix: ""
    # -- Job Start Latency histogram buckets. Specified in seconds.
    jobStartLatencyBuckets: "30,60,90,120,150,180,210,240,270,300"

  # Prometheus pod monitor for controller pods
  podMonitor:
    # -- Specifies whether to create pod monitor.
    # Note that prometheus metrics should be enabled as well.
    create: false
    # -- Pod monitor labels
    labels: {}
    # -- The label to use to retrieve the job name from
    jobLabel: spark-operator-podmonitor
    # -- Prometheus metrics endpoint properties. `metrics.portName` will be used as a port
    podMetricsEndpoint:
      scheme: http
      interval: 5s

Prometheus config 설정

prometheus helm chart 설치할 때 config에 spark-operator podMonitor 관련 설정을 추가해준다.

# custom-values.yaml
prometheus:
  additionalPodMonitors:
    - name: spark-operator
      namespaceSelector:
        matchNames:
          - spark-operator
      selector:
        matchLabels:
          app.kubernetes.io/name: spark-operator
      podMetricsEndpoints:
        - port: metrics
          path: /metrics
          interval: 5s

prometheus helm chart를 설치할 때 위에서 작성한 custom-values.yaml 파일을 넘겨준다.

# repo 추가
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# helm chart 설치
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace -f custom-values.yaml

# helm chart 설치 확인
❯ kubectl get all -n monitoring
NAME                                                         READY   STATUS    RESTARTS   AGE
pod/alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          4m14s
pod/prometheus-grafana-995866d9b-pv9r7                       3/3     Running   0          4m15s
pod/prometheus-kube-prometheus-operator-565cbd8649-plm2c     1/1     Running   0          4m15s
pod/prometheus-kube-state-metrics-849c746cf5-nkjjb           1/1     Running   0          4m15s
pod/prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          4m14s
pod/prometheus-prometheus-node-exporter-kqdws                1/1     Running   0          4m15s

NAME                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   4m14s
service/prometheus-grafana                        ClusterIP   10.106.190.0     <none>        80/TCP                       4m15s
service/prometheus-kube-prometheus-alertmanager   ClusterIP   10.106.227.167   <none>        9093/TCP,8080/TCP            4m15s
service/prometheus-kube-prometheus-operator       ClusterIP   10.98.152.26     <none>        443/TCP                      4m15s
service/prometheus-kube-prometheus-prometheus     ClusterIP   10.105.251.35    <none>        9090/TCP,8080/TCP            4m15s
service/prometheus-kube-state-metrics             ClusterIP   10.97.95.107     <none>        8080/TCP                     4m15s
service/prometheus-operated                       ClusterIP   None             <none>        9090/TCP                     4m14s
service/prometheus-prometheus-node-exporter       ClusterIP   10.101.172.93    <none>        9100/TCP                     4m15s

NAME                                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/prometheus-prometheus-node-exporter   1         1         1       1            1           kubernetes.io/os=linux   4m15s

NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-grafana                    1/1     1            1           4m15s
deployment.apps/prometheus-kube-prometheus-operator   1/1     1            1           4m15s
deployment.apps/prometheus-kube-state-metrics         1/1     1            1           4m15s

NAME                                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-grafana-995866d9b                     1         1         1       4m15s
replicaset.apps/prometheus-kube-prometheus-operator-565cbd8649   1         1         1       4m15s
replicaset.apps/prometheus-kube-state-metrics-849c746cf5         1         1         1       4m15s

NAME                                                                    READY   AGE
statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager   1/1     4m14s
statefulset.apps/prometheus-prometheus-kube-prometheus-prometheus       1/1     4m14s

# 생성된 podmonitor 확인
❯ kubectl get podmonitor -A
NAMESPACE    NAME             AGE
monitoring   spark-operator   7m31s

Prometheus web 에서 spark-operator metrics 확인

ingress 설정하기 귀찮으니 portforward 해서 localhost 도메인으로 확인
아래 명령어 실행 후 웹브라우저에서 localhost:9090 에 접속해서 metric 이 정상적으로 수집되는지 확인한다.

# prometheus web port-forwarding
❯ kubectl port-forward pods/prometheus-prometheus-kube-prometheus-prometheus-0 9090:9090 -n monitoring

localhost:9090/targets 에서 아래와 같이 정상 상태 확인

localhost:9090/query 에서 아래와 같이 metric query 가능

spark operator 에서 생성하는 metrics 종류

https://www.kubeflow.org/docs/components/spark-operator/getting-started/#spark-application-metrics

Spark Application Metrics

  • spark_application_count: Total number of SparkApplication handled by the Operator.
  • spark_application_submit_count: Total number of SparkApplication spark-submitted by the Operator.
  • spark_application_success_count: Total number of SparkApplication which completed successfully.
  • spark_application_failure_count: Total number of SparkApplication which failed to complete.
  • spark_application_running_count: Total number of SparkApplication which are currently running.
  • spark_application_success_execution_time_seconds: Execution time for applications which succeeded.
  • spark_application_failure_execution_time_seconds: Execution time for applications which failed.
  • spark_application_start_latency_seconds: Start latency of SparkApplication as type of Prometheus Summary.
  • spark_application_start_latency_seconds: Start latency of SparkApplication as type of Prometheus Histogram.
  • spark_executor_success_count: Total number of Spark Executors which completed successfully.
  • spark_executor_failure_count: Total number of Spark Executors which failed.
  • spark_executor_running_count: Total number of Spark Executors which are currently running.

Work Queue Metrics

  • workqueue_depth: Current depth of workqueue
  • workqueue_adds_total: Total number of adds handled by workqueue
  • workqueue_queue_duration_seconds_bucket: How long in seconds an item stays in workqueue before being requested
  • workqueue_work_duration_seconds_bucket: How long in seconds processing an item from workqueue takes
  • workqueue_retries_total: Total number of retries handled by workqueue
  • workqueue_unfinished_work_seconds: Unfinished work in seconds
  • workqueue_longest_running_processor_seconds: Longest running processor in seconds

댓글