团队内部有超过一百个基于 SSG (Static Site Generator) 的项目,主要是产品文档、API 参考和内部知识库。最初,我们通过标准的 Nginx Ingress Controller 来暴露这些服务。随着站点数量的增长,这个方案的痛点愈发明显:每一次新增或修改一个 Ingress 规则,Nginx Ingress Controller 都需要重新生成完整的 Nginx 配置并执行 reload。在高频率变更的场景下,这不仅引入了可感知的延迟,还存在配置热加载失败的风险。我们需要一个更动态、更精细化的路由层,但引入 Istio 这样完整的服务网格对于我们这个场景来说,无疑是杀鸡用牛刀。
我们的目标很明确:为这些静态站点实现一个轻量级、高性能、动态可配的边缘代理。Envoy Proxy 是不二之e选,其核心优势在于通过 xDS (Discovery Service) API 实现的动态配置能力。但这要求我们提供一个 xDS 的服务端,即控制平面。与其采用现成的复杂方案,我们决定自研一个专用的、高度定制化的控制平面。它将只做一件事,并把它做到极致:将 Kubernetes 原生的资源对象翻译成 Envoy 的动态配置。
选择 Go 来构建这个工具,是因为它在云原生领域的生态优势:强大的并发模型、高效的性能、以及与 Kubernetes API 交互的成熟库 client-go
。
第一步:定义我们的 API - StaticSite
CRD
平台工程的核心理念之一,是为开发者提供声明式的、简洁的接口。我们不希望开发者关心 Envoy 的 Cluster
、Route
或 Listener
是什么。他们只需要定义“我有一个站点,域名是 A,后端的 Kubernetes Service 是 B”。
为此,我们使用 kubebuilder
创建了一个 Custom Resource Definition (CRD),名为 StaticSite
。
这是 api/v1alpha1/staticsite_types.go
的核心定义:
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Domain",type=string,JSONPath=`.spec.domain`
// +kubebuilder:printcolumn:name="Service",type=string,JSONPath=`.spec.backend.serviceName`
// +kubebuilder:printcolumn:name="Port",type=integer,JSONPath=`.spec.backend.servicePort`
// +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.phase`
// StaticSite is the Schema for the staticsites API
type StaticSite struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec StaticSiteSpec `json:"spec,omitempty"`
Status StaticSiteStatus `json:"status,omitempty"`
}
// StaticSiteSpec defines the desired state of StaticSite
type StaticSiteSpec struct {
// Domain is the fully qualified domain name for the static site.
// This will be used for host matching in Envoy.
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=4
Domain string `json:"domain"`
// Backend defines the Kubernetes Service that serves the static content.
// +kubebuilder:validation:Required
Backend BackendService `json:"backend"`
}
// BackendService defines the service details for the backend.
type BackendService struct {
// ServiceName is the name of the Kubernetes Service.
// +kubebuilder:validation:Required
ServiceName string `json:"serviceName"`
// ServicePort is the port number of the Kubernetes Service.
// +kubebuilder:validation:Required
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
ServicePort int32 `json:"servicePort"`
}
// StaticSiteStatus defines the observed state of StaticSite
type StaticSiteStatus struct {
// Phase indicates the current status of the resource.
// e.g., "Provisioned", "Error"
Phase string `json:"phase,omitempty"`
// Message provides more details about the current status.
Message string `json:"message,omitempty"`
}
// +kubebuilder:object:root=true
// StaticSiteList contains a list of StaticSite
type StaticSiteList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []StaticSite `json:"items"`
}
func init() {
SchemeBuilder.Register(&StaticSite{}, &StaticSiteList{})
}
开发者只需要提交这样一份 YAML:
apiVersion: platform.my-company.com/v1alpha1
kind: StaticSite
metadata:
name: docs-main
namespace: docs
spec:
domain: "docs.my-company.com"
backend:
serviceName: "docs-main-svc"
servicePort: 80
我们的控制平面将负责监听这些 StaticSite
资源的创建、更新和删除,并将其转化为 Envoy 配置。
核心组件:xDS 控制平面的实现
控制平面的架构分为两个主要部分:
- Kubernetes Informer:使用
client-go
实时监听StaticSite
CRD 的变化。 - xDS Server:使用
go-control-plane
库构建 gRPC 服务,为 Envoy 提供配置。
graph TD A[Developers Commit StaticSite YAML] --> B{Git Repository}; B --> C[ArgoCD / GitOps Controller]; C --> D[Creates/Updates StaticSite CR in EKS]; subgraph EKS Cluster subgraph Control Plane Pod E[Go Control Plane] end subgraph Envoy Pod F[Envoy Proxy] end subgraph Application Pods G[SSG App Pods] end D -- Watches --> E; E -- Pushes xDS Config --> F; end F -- Routes Traffic --> G; H[End User] -- HTTPS Request --> F;
1. 项目结构与依赖
我们的 go.mod
文件包含以下关键依赖:
go 1.21
require (
github.com/envoyproxy/go-control-plane v0.11.1
google.golang.org/grpc v1.58.2
k8s.io/api v0.28.2
k8s.io/apimachinery v0.28.2
k8s.io/client-go v0.28.2
// ... 其他 kubebuilder 和 controller-runtime 依赖
)
2. xDS Snapshot Cache
go-control-plane
的核心是 SnapshotCache
。它是一个线程安全的缓存,存储了 Envoy 所需的全部配置的一个“快照”。当配置发生变化时,我们不是去修改缓存中的某个片段,而是生成一个全新的快照,并用它替换旧的快照。这保证了配置的一致性。
我们的控制器需要一个回调函数,每当 StaticSite
资源发生变化时,这个函数就会被触发,以重新生成快照。
// xds/cache.go
package xds
import (
"context"
"fmt"
"sync"
"sync/atomic"
"github.com/envoyproxy/go-control-plane/pkg/cache/types"
"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
"github.com/envoyproxy/go-control-plane/pkg/resource/v3"
log "github.com/sirupsen/logrus"
platformv1alpha1 "my-company.com/staticsite-operator/api/v1alpha1"
)
// XdsCache is a wrapper around the go-control-plane snapshot cache.
type XdsCache struct {
snapshotCache cache.SnapshotCache
version atomic.Int64
mutex sync.Mutex
nodeID string
}
// NewXdsCache creates a new instance of our xDS cache.
func NewXdsCache(nodeID string) *XdsCache {
return &XdsCache{
snapshotCache: cache.NewSnapshotCache(false, cache.IDHash{}, nil),
nodeID: nodeID,
}
}
// GenerateSnapshot creates a new snapshot of xDS resources from a list of StaticSites.
func (c *XdsCache) GenerateSnapshot(sites []platformv1alpha1.StaticSite) error {
c.mutex.Lock()
defer c.mutex.Unlock()
// Increment version for the new snapshot.
// Versions must be unique and monotonically increasing.
newVersion := c.version.Add(1)
versionStr := fmt.Sprintf("%d", newVersion)
log.Infof("Generating new snapshot version %s for %d sites", versionStr, len(sites))
clusters := []types.Resource{}
routes := []types.Resource{}
for _, site := range sites {
// For each StaticSite resource, we generate a corresponding
// Envoy Cluster and a Route entry.
clusterName := fmt.Sprintf("%s_%s_cluster", site.Namespace, site.Spec.Backend.ServiceName)
routeName := fmt.Sprintf("%s_%s_route", site.Namespace, site.Name)
// Create an Envoy Cluster resource.
// This tells Envoy how to connect to the backend Kubernetes Service.
cls := makeCluster(clusterName, site.Spec.Backend.ServiceName, site.Namespace, uint32(site.Spec.Backend.ServicePort))
clusters = append(clusters, cls)
// Create an Envoy Route configuration part for this site.
// This is part of a larger VirtualHost configuration.
route := makeRoute(routeName, site.Spec.Domain, clusterName)
routes = append(routes, route)
}
// The listener is static. It listens on port 8080 and refers to the RouteConfiguration
// which we will dynamically update.
const listenerName = "http_listener"
const routeConfigName = "dynamic_routes"
listener := makeHTTPListener(listenerName, routeConfigName)
// A RouteConfiguration contains all the virtual hosts and routes.
routeConfig := makeRouteConfig(routeConfigName, routes)
snapshot, err := cache.NewSnapshot(
versionStr,
map[resource.Type][]types.Resource{
resource.ClusterType: clusters,
resource.RouteType: {routeConfig},
resource.ListenerType: {listener},
// EDS (Endpoint Discovery Service) is handled by Envoy's DNS resolver
// for 'STRICT_DNS' cluster type, so we don't need to provide Endpoints here.
// This simplifies our control plane significantly.
},
)
if err != nil {
log.Errorf("Failed to create snapshot version %s: %v", versionStr, err)
return err
}
if err := snapshot.Consistent(); err != nil {
log.Errorf("Snapshot version %s is not consistent: %v", versionStr, err)
return err
}
// Set the new snapshot for our node ID.
// All connected Envoy proxies with this node ID will receive this update.
if err := c.snapshotCache.SetSnapshot(context.Background(), c.nodeID, snapshot); err != nil {
log.Errorf("Failed to set snapshot version %s: %v", versionStr, err)
return err
}
log.Infof("Successfully set snapshot version %s", versionStr)
return nil
}
3. 生成 Envoy 资源
这里的 makeCluster
, makeRoute
, makeHTTPListener
等函数是核心的转换逻辑。它们使用 go-control-plane
提供的 protobuf 结构体来构建 Envoy 配置。
一个关键的决策是 Cluster
的服务发现类型。我们选择了 STRICT_DNS
。这意味着 Envoy 会自己去解析 Kubernetes Service 的 DNS 名称(例如 docs-main-svc.docs.svc.cluster.local
)来找到后端的 Pod IP。这极大地简化了我们的控制平面:我们无需实现 EDS (Endpoint Discovery Service) 来追踪每个 Service 对应的 Pod IP 列表。这是在我们的场景(后端 Service 相对稳定)下的一个务实权衡。
// xds/resources.go
package xds
import (
"fmt"
"time"
core "github.com/envoyproxy/go-control-plane/envoy/config/core/v3"
listener "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3"
route "github.com/envoyproxy/go-control-plane/envoy/config/route/v3"
hcm "github.com/envoyproxy/go-control-plane/envoy/extensions/filters/network/http_connection_manager/v3"
"github.com/envoyproxy/go-control-plane/pkg/wellknown"
"google.golang.org/protobuf/types/known/anypb"
"google.golang.org/protobuf/types/known/durationpb"
// ... other imports
)
// makeCluster creates a STRICT_DNS cluster.
func makeCluster(clusterName, serviceName, serviceNamespace string, servicePort uint32) *cluster.Cluster {
return &cluster.Cluster{
Name: clusterName,
ConnectTimeout: durationpb.New(5 * time.Second),
ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_STRICT_DNS},
// In a Kubernetes environment, a service can be resolved via DNS.
// e.g., my-service.my-namespace.svc.cluster.local
LoadAssignment: &endpoint.ClusterLoadAssignment{
ClusterName: clusterName,
Endpoints: []*endpoint.LocalityLbEndpoints{
{
LbEndpoints: []*endpoint.LbEndpoint{
{
HostIdentifier: &endpoint.LbEndpoint_Endpoint{
Endpoint: &endpoint.Endpoint{
Address: &core.Address{
Address: &core.Address_SocketAddress{
SocketAddress: &core.SocketAddress{
Protocol: core.SocketAddress_TCP,
Address: fmt.Sprintf("%s.%s.svc.cluster.local", serviceName, serviceNamespace),
PortSpecifier: &core.SocketAddress_PortValue{PortValue: servicePort},
},
},
},
},
},
},
},
},
},
},
}
}
// makeRoute creates a route for a specific domain pointing to a cluster.
func makeRoute(routeName, domain, clusterName string) *route.VirtualHost {
return &route.VirtualHost{
Name: routeName,
Domains: []string{domain},
Routes: []*route.Route{
{
Match: &route.RouteMatch{
PathSpecifier: &route.RouteMatch_Prefix{
Prefix: "/",
},
},
Action: &route.Route_Route{
Route: &route.RouteAction{
ClusterSpecifier: &route.RouteAction_Cluster{
Cluster: clusterName,
},
},
},
},
},
}
}
// makeHTTPListener creates a basic HTTP listener.
func makeHTTPListener(listenerName, routeConfigName string) *listener.Listener {
// ... implementation to create a listener with an HttpConnectionManager filter
// that points to our dynamic RouteConfiguration via RDS.
}
// makeRouteConfig aggregates virtual hosts into a single RouteConfiguration.
func makeRouteConfig(configName string, virtualHosts []*route.VirtualHost) *route.RouteConfiguration {
return &route.RouteConfiguration{
Name: configName,
VirtualHosts: virtualHosts,
}
}
4. Kubernetes 控制器逻辑
现在,我们需要将 Kubernetes 的事件流与我们的 xDS 缓存连接起来。controller-runtime
库让这件事变得相对简单。我们创建一个 Reconciler
,它的 Reconcile
方法会在 StaticSite
资源发生变化时被调用。
// internal/controller/staticsite_controller.go
package controller
import (
"context"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
platformv1alpha1 "my-company.com/staticsite-operator/api/v1alpha1"
"my-company.com/staticsite-operator/internal/xds"
)
type StaticSiteReconciler struct {
client.Client
Scheme *runtime.Scheme
XdsCache *xds.XdsCache // Our xDS cache instance
}
// Reconcile is the main loop. It's triggered by changes to StaticSite resources.
func (r *StaticSiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
logger.Info("Reconciling StaticSite resource")
// The core logic: whenever any StaticSite resource changes, we fetch ALL
// StaticSite resources in the cluster and regenerate the ENTIRE snapshot.
// This is an "aggregate" reconciliation pattern. It's simpler to implement
// and ensures consistency, at the cost of being less efficient for very large
// numbers of resources. For our scale (hundreds of sites), this is acceptable.
var siteList platformv1alpha1.StaticSiteList
if err := r.List(ctx, &siteList); err != nil {
logger.Error(err, "unable to fetch StaticSite list")
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// This is the trigger point. We pass the full list of current sites
// to our cache to generate and publish a new snapshot.
if err := r.XdsCache.GenerateSnapshot(siteList.Items); err != nil {
logger.Error(err, "failed to generate and set xDS snapshot")
// We can optionally update the status of all resources to "Error" here.
return ctrl.Result{}, err
}
logger.Info("Successfully reconciled and updated xDS snapshot")
return ctrl.Result{}, nil
}
// SetupWithManager sets up the controller with the Manager.
func (r *StaticSiteReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&platformv1alpha1.StaticSite{}).
Complete(r)
}
部署与集成
Envoy Bootstrap 配置
Envoy 实例(我们的数据平面)在启动时需要一个静态的 bootstrap 配置文件,告诉它去哪里寻找它的控制平面。
# envoy-bootstrap.yaml
node:
# This ID must match the nodeID used by the control plane cache.
id: "ssg-gateway"
cluster: "ssg-gateway-cluster"
static_resources: {} # All resources are dynamic.
dynamic_resources:
lds_config:
ads: {}
cds_config:
ads: {}
ads_config:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
# This is the Kubernetes Service name for our Go control plane.
cluster_name: xds_cluster
set_node_on_first_message_only: true
# This cluster definition tells Envoy how to connect to our control plane.
# It is the ONLY static configuration Envoy needs.
static_resources:
clusters:
- name: xds_cluster
type: STRICT_DNS
connect_timeout: 5s
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
load_assignment:
cluster_name: xds_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
# The control plane is exposed via a service named 'staticsite-operator-xds-svc' on port 9000
address: staticsite-operator-xds-svc.operators.svc.cluster.local
port_value: 9000
我们将这个配置文件打包进 Envoy 的容器镜像,或者通过 ConfigMap 挂载。
EKS 部署清单
最后,我们将控制平面和 Envoy 网关部署到 EKS 集群。
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: staticsite-operator
namespace: operators
spec:
replicas: 1
template:
spec:
serviceAccountName: staticsite-operator-sa
containers:
- name: operator
image: <your-repo>/staticsite-operator:latest
command: ["/manager"]
args: ["--leader-elect"]
---
apiVersion: v1
kind: Service
metadata:
name: staticsite-operator-xds-svc
namespace: operators
spec:
type: ClusterIP
ports:
- name: grpc-xds
port: 9000
targetPort: 9000
selector:
app.kubernetes.io/name: staticsite-operator
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ssg-envoy-gateway
namespace: ingress
spec:
replicas: 3
template:
spec:
containers:
- name: envoy
image: envoyproxy/envoy:v1.27.0
args:
- "-c"
- "/etc/envoy/envoy-bootstrap.yaml"
# Add other args for logging, concurrency etc.
ports:
- containerPort: 8080 # This matches the port in our dynamic listener
volumeMounts:
- name: envoy-config
mountPath: /etc/envoy
volumes:
- name: envoy-config
configMap:
name: ssg-envoy-bootstrap-cm
通过这个方案,我们实现了一个高效、专注的平台组件。开发者通过他们熟悉的 Kubernetes 原生方式(提交 YAML)来部署和管理静态站点,而底层的路由复杂性被完全屏蔽。当一个 StaticSite
CRD 被创建或修改时,我们的 Go 控制平面几乎是瞬时地计算出新的配置快照,并通过 xDS 推送给所有 Envoy 代理。整个过程无需重载,对现有流量毫无影响,完美解决了我们最初面临的性能和稳定性瓶颈。
局限与未来路径
这个方案并非没有缺点。当前的设计将所有路由配置聚合到单一的 RouteConfiguration
中。当站点数量达到数千级别时,每次微小变更都重新生成和传输完整的配置,会对控制平面和 Envoy 的内存造成压力。一个明确的优化路径是转向增量 xDS(Delta xDS),它允许控制平面只发送变更的部分,显著降低了配置更新的开销。
此外,当前的控制平面是一个单点,虽然 Kubernetes 的 Deployment
会在它崩溃后自动重启,但在重启期间无法响应配置变更。可以通过实现主备模式(Leader Election)来提高其可用性。
最后,可以进一步丰富 StaticSite
CRD 的功能,例如增加对 TLS 证书的自动管理(集成 cert-manager
)、支持简单的重写规则或自定义 Header 等,将更多 Envoy 的强大功能以声明式 API 的形式暴露给开发者。