查看“︁Probe for spring boot”︁的源代码


=issue=
<pre>
如果日志是：

```text
Readiness probe failed:
Get "http://10.6.6.1:8080/actuator/health":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
```

**按 Kubernetes 的设计，仅仅 Readiness Probe 失败不会导致 Pod 重启。**

所以你需要先确认：

```bash
kubectl describe pod <pod-name>
```

看看 Events 里面是不是还有：

```text
Liveness probe failed
```

或者：

```text
Killing container
```

或者：

```text
OOMKilled
```

---

# 第一种情况：实际上是 Liveness 失败导致重启

经常看到：

```yaml
livenessProbe:
  httpGet:
    path: /actuator/health

readinessProbe:
  httpGet:
    path: /actuator/health
```

Spring Boot 在：

* 数据库慢
* Redis慢
* GC停顿

时：

```text
/actuator/health
```

响应超过 timeout。

然后：

```text
Readiness Fail
Liveness Fail
```

最终：

```text
Pod Restart
```

---

# 第二种情况：timeout 设置太小

例如：

```yaml
readinessProbe:
  timeoutSeconds: 1
```

Spring Boot Actuator 访问数据库时：

```text
耗时 2~3 秒
```

就会报：

```text
context deadline exceeded
```

建议：

```yaml
readinessProbe:
  timeoutSeconds: 5
```

甚至：

```yaml
readinessProbe:
  timeoutSeconds: 10
```

---

# 第三种情况：数据库检查拖慢了 Health Check

Spring Boot 默认：

```text
/actuator/health
```

会检查：

* DB
* Redis
* RabbitMQ
* Kafka
* Elasticsearch

例如数据库偶尔卡顿：

```text
Health Endpoint
 ↓
等待 DB
 ↓
超时
 ↓
Probe Failed
```

---

查看：

```bash
curl localhost:8080/actuator/health
```

或者：

```bash
time curl localhost:8080/actuator/health
```

看看是不是很慢。

---

# 推荐使用 Spring Boot Probe Endpoint

开启：

```yaml
management:
  endpoint:
    health:
      probes:
        enabled: true
```

然后：

```yaml
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
```

这样：

### Liveness

只检查 JVM 是否活着。

### Readiness

检查业务依赖。

避免：

```text
DB挂了
↓
Liveness失败
↓
Pod无限重启
```

---

# 第四种情况：JVM 卡顿

如果日志里经常出现：

```text
context deadline exceeded
```

但应用日志没有报错，

检查：

```bash
kubectl top pod
```

看看：

```text
CPU 100%
Memory 接近 Limit
```

以及：

```bash
kubectl logs pod
```

是否有：

```text
Full GC
OutOfMemoryError
```

---

# 第五种情况：启动太慢

如果是刚启动就出现：

```text
Readiness probe failed
```

可能：

* Flyway migration
* Hibernate 初始化
* 缓存预热

需要增加：

```yaml
startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 60
  periodSeconds: 5
```

这样允许：

```text
5分钟启动时间
```

---

# 作为 DevOps，我会先执行这几个命令

```bash
kubectl describe pod <pod>
```

看：

```text
Liveness probe failed?
OOMKilled?
Exit Code?
```

```bash
kubectl get pod <pod> -o yaml | grep -A20 probe
```

看当前 Probe 配置。

```bash
kubectl logs <pod> --previous
```

看上一次容器退出原因。

```bash
kubectl top pod <pod>
```

看资源是否耗尽。

---

如果你把下面这些贴出来：

```bash
kubectl describe pod <pod>

kubectl get deploy xxx -o yaml

kubectl logs <pod> --previous
```

我基本可以判断到底是：

* Probe 配置问题
* Spring Boot Health Check 问题
* OOM
* GC 卡顿
* 数据库连接慢
* 启动时间过长

哪一种导致的重启。

</pre>
=3=
<pre>
对于 **Spring Boot 跑在 Kubernetes**，现在比较推荐的做法是：

* **Startup Probe**：解决应用启动慢的问题
* **Readiness Probe**：判断是否接收流量
* **Liveness Probe**：判断是否需要重启

不要三个 Probe 都配成一样。

---

## 1. Spring Boot 开启 Actuator

通常在 `application.yml`：

```yaml
management:
  endpoints:
    web:
      exposure:
        include: health
  endpoint:
    health:
      probes:
        enabled: true
```

Spring Boot 2.3+ 会自动提供：

```text
/actuator/health
/actuator/health/liveness
/actuator/health/readiness
```

---

## 2. 推荐 Probe 配置

```yaml
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  periodSeconds: 10
  failureThreshold: 30
```

---

## 3. 三个 Probe 的职责

### Startup Probe

解决：

```text
Spring Boot启动慢
JPA初始化
Flyway迁移
缓存预热
```

例如：

```text
启动需要3分钟
```

那么：

```yaml
failureThreshold: 30
periodSeconds: 10
```

表示：

```text
允许启动5分钟
```

启动成功前：

```text
不执行Liveness
不执行Readiness
```

避免：

```text
启动还没完成
K8s就把Pod杀了
```

---

### Readiness Probe

判断：

```text
能否接收业务流量
```

例如：

* 数据库连接正常
* Redis正常
* Kafka正常

通过：

```text
Ready=True
```

Service 才会转发流量。

失败：

```text
Ready=False
```

Pod不会被重启。

只是：

```text
从Service Endpoint移除
```

---

### Liveness Probe

判断：

```text
JVM是否假死
```

例如：

* 死锁
* 线程池卡死
* GC长时间卡顿

失败：

```text
连续3次失败
```

K8s：

```text
重启Pod
```

---

## 4. Spring Boot 最佳实践

不要：

```yaml
livenessProbe:
  path: /actuator/health

readinessProbe:
  path: /actuator/health
```

因为：

```text
数据库挂了
```

Spring Boot 默认：

```json
{
  "status":"DOWN"
}
```

结果：

```text
Liveness失败
Readiness失败
```

Pod不断重启。

形成：

```text
CrashLoopBackOff
```

---

推荐：

```text
Liveness
    只检查JVM是否活着

Readiness
    检查数据库
    Redis
    MQ
```

即：

```text
/actuator/health/liveness
/actuator/health/readiness
```

---

## 5. 大厂常见配置

例如：

```yaml
startupProbe:
  failureThreshold: 60
  periodSeconds: 5

readinessProbe:
  periodSeconds: 10

livenessProbe:
  periodSeconds: 30
```

原因：

```text
Readiness检查频繁
Liveness检查保守
```

避免：

```text
短暂GC
网络抖动
```

导致误重启。

---

## 6. AWS EKS 场景

如果前面有：

* ALB
* NLB
* Ingress Nginx

建议：

```text
ALB Health Check
          ↓
Readiness Probe
          ↓
Spring Boot Readiness
```

形成健康检查链路。

例如：

```text
ALB
 ↓
Ingress
 ↓
Service
 ↓
Pod Readiness
 ↓
DB
```

这样数据库异常时：

```text
Readiness=False
```

Pod自动摘流量，

而不是：

```text
Pod被疯狂重启
```

---

DevOps/SRE 面试里如果被问：

> Spring Boot 的 Liveness 和 Readiness 怎么设计？

标准回答是：

> Liveness 只检查应用进程和 JVM 是否存活，不依赖外部服务；Readiness 检查应用是否具备对外提供服务的能力，包括数据库、缓存和消息队列连接状态。对于启动较慢的 Spring Boot 应用，还会增加 Startup Probe 防止启动期间被 Kubernetes 误判并重启。

</pre>
[[category:k8s]]