在 AWS EKS 中为多租户 SSG 部署构建专用 Go Envoy xDS 控制平面


团队内部有超过一百个基于 SSG (Static Site Generator) 的项目,主要是产品文档、API 参考和内部知识库。最初,我们通过标准的 Nginx Ingress Controller 来暴露这些服务。随着站点数量的增长,这个方案的痛点愈发明显:每一次新增或修改一个 Ingress 规则,Nginx Ingress Controller 都需要重新生成完整的 Nginx 配置并执行 reload。在高频率变更的场景下,这不仅引入了可感知的延迟,还存在配置热加载失败的风险。我们需要一个更动态、更精细化的路由层,但引入 Istio 这样完整的服务网格对于我们这个场景来说,无疑是杀鸡用牛刀。

我们的目标很明确:为这些静态站点实现一个轻量级、高性能、动态可配的边缘代理。Envoy Proxy 是不二之e选,其核心优势在于通过 xDS (Discovery Service) API 实现的动态配置能力。但这要求我们提供一个 xDS 的服务端,即控制平面。与其采用现成的复杂方案,我们决定自研一个专用的、高度定制化的控制平面。它将只做一件事,并把它做到极致:将 Kubernetes 原生的资源对象翻译成 Envoy 的动态配置。

选择 Go 来构建这个工具,是因为它在云原生领域的生态优势:强大的并发模型、高效的性能、以及与 Kubernetes API 交互的成熟库 client-go

第一步:定义我们的 API - StaticSite CRD

平台工程的核心理念之一,是为开发者提供声明式的、简洁的接口。我们不希望开发者关心 Envoy 的 ClusterRouteListener 是什么。他们只需要定义“我有一个站点,域名是 A,后端的 Kubernetes Service 是 B”。

为此,我们使用 kubebuilder 创建了一个 Custom Resource Definition (CRD),名为 StaticSite

这是 api/v1alpha1/staticsite_types.go 的核心定义:

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Domain",type=string,JSONPath=`.spec.domain`
// +kubebuilder:printcolumn:name="Service",type=string,JSONPath=`.spec.backend.serviceName`
// +kubebuilder:printcolumn:name="Port",type=integer,JSONPath=`.spec.backend.servicePort`
// +kubebuilder:printcolumn:name="Status",type=string,JSONPath=`.status.phase`

// StaticSite is the Schema for the staticsites API
type StaticSite struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   StaticSiteSpec   `json:"spec,omitempty"`
	Status StaticSiteStatus `json:"status,omitempty"`
}

// StaticSiteSpec defines the desired state of StaticSite
type StaticSiteSpec struct {
	// Domain is the fully qualified domain name for the static site.
	// This will be used for host matching in Envoy.
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MinLength=4
	Domain string `json:"domain"`

	// Backend defines the Kubernetes Service that serves the static content.
	// +kubebuilder:validation:Required
	Backend BackendService `json:"backend"`
}

// BackendService defines the service details for the backend.
type BackendService struct {
	// ServiceName is the name of the Kubernetes Service.
	// +kubebuilder:validation:Required
	ServiceName string `json:"serviceName"`

	// ServicePort is the port number of the Kubernetes Service.
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:validation:Maximum=65535
	ServicePort int32 `json:"servicePort"`
}

// StaticSiteStatus defines the observed state of StaticSite
type StaticSiteStatus struct {
	// Phase indicates the current status of the resource.
	// e.g., "Provisioned", "Error"
	Phase string `json:"phase,omitempty"`
	// Message provides more details about the current status.
	Message string `json:"message,omitempty"`
}

// +kubebuilder:object:root=true

// StaticSiteList contains a list of StaticSite
type StaticSiteList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []StaticSite `json:"items"`
}

func init() {
	SchemeBuilder.Register(&StaticSite{}, &StaticSiteList{})
}

开发者只需要提交这样一份 YAML:

apiVersion: platform.my-company.com/v1alpha1
kind: StaticSite
metadata:
  name: docs-main
  namespace: docs
spec:
  domain: "docs.my-company.com"
  backend:
    serviceName: "docs-main-svc"
    servicePort: 80

我们的控制平面将负责监听这些 StaticSite 资源的创建、更新和删除,并将其转化为 Envoy 配置。

核心组件:xDS 控制平面的实现

控制平面的架构分为两个主要部分:

  1. Kubernetes Informer:使用 client-go 实时监听 StaticSite CRD 的变化。
  2. xDS Server:使用 go-control-plane 库构建 gRPC 服务,为 Envoy 提供配置。
graph TD
    A[Developers Commit StaticSite YAML] --> B{Git Repository};
    B --> C[ArgoCD / GitOps Controller];
    C --> D[Creates/Updates StaticSite CR in EKS];
    subgraph EKS Cluster
        subgraph Control Plane Pod
            E[Go Control Plane]
        end
        subgraph Envoy Pod
            F[Envoy Proxy]
        end
        subgraph Application Pods
            G[SSG App Pods]
        end
        D -- Watches --> E;
        E -- Pushes xDS Config --> F;
    end
    F -- Routes Traffic --> G;
    H[End User] -- HTTPS Request --> F;

1. 项目结构与依赖

我们的 go.mod 文件包含以下关键依赖:

go 1.21

require (
	github.com/envoyproxy/go-control-plane v0.11.1
	google.golang.org/grpc v1.58.2
	k8s.io/api v0.28.2
	k8s.io/apimachinery v0.28.2
	k8s.io/client-go v0.28.2
	// ... 其他 kubebuilder 和 controller-runtime 依赖
)

2. xDS Snapshot Cache

go-control-plane 的核心是 SnapshotCache。它是一个线程安全的缓存,存储了 Envoy 所需的全部配置的一个“快照”。当配置发生变化时,我们不是去修改缓存中的某个片段,而是生成一个全新的快照,并用它替换旧的快照。这保证了配置的一致性。

我们的控制器需要一个回调函数,每当 StaticSite 资源发生变化时,这个函数就会被触发,以重新生成快照。

// xds/cache.go

package xds

import (
    "context"
    "fmt"
    "sync"
    "sync/atomic"

    "github.com/envoyproxy/go-control-plane/pkg/cache/types"
    "github.com/envoyproxy/go-control-plane/pkg/cache/v3"
    "github.com/envoyproxy/go-control-plane/pkg/resource/v3"
    log "github.com/sirupsen/logrus"

    platformv1alpha1 "my-company.com/staticsite-operator/api/v1alpha1"
)

// XdsCache is a wrapper around the go-control-plane snapshot cache.
type XdsCache struct {
    snapshotCache cache.SnapshotCache
    version       atomic.Int64
    mutex         sync.Mutex
    nodeID        string
}

// NewXdsCache creates a new instance of our xDS cache.
func NewXdsCache(nodeID string) *XdsCache {
    return &XdsCache{
        snapshotCache: cache.NewSnapshotCache(false, cache.IDHash{}, nil),
        nodeID:        nodeID,
    }
}

// GenerateSnapshot creates a new snapshot of xDS resources from a list of StaticSites.
func (c *XdsCache) GenerateSnapshot(sites []platformv1alpha1.StaticSite) error {
    c.mutex.Lock()
    defer c.mutex.Unlock()

    // Increment version for the new snapshot.
    // Versions must be unique and monotonically increasing.
    newVersion := c.version.Add(1)
    versionStr := fmt.Sprintf("%d", newVersion)

    log.Infof("Generating new snapshot version %s for %d sites", versionStr, len(sites))

    clusters := []types.Resource{}
    routes := []types.Resource{}

    for _, site := range sites {
        // For each StaticSite resource, we generate a corresponding
        // Envoy Cluster and a Route entry.
        clusterName := fmt.Sprintf("%s_%s_cluster", site.Namespace, site.Spec.Backend.ServiceName)
        routeName := fmt.Sprintf("%s_%s_route", site.Namespace, site.Name)
        
        // Create an Envoy Cluster resource.
        // This tells Envoy how to connect to the backend Kubernetes Service.
        cls := makeCluster(clusterName, site.Spec.Backend.ServiceName, site.Namespace, uint32(site.Spec.Backend.ServicePort))
        clusters = append(clusters, cls)
        
        // Create an Envoy Route configuration part for this site.
        // This is part of a larger VirtualHost configuration.
        route := makeRoute(routeName, site.Spec.Domain, clusterName)
        routes = append(routes, route)
    }
    
    // The listener is static. It listens on port 8080 and refers to the RouteConfiguration
    // which we will dynamically update.
    const listenerName = "http_listener"
    const routeConfigName = "dynamic_routes"

    listener := makeHTTPListener(listenerName, routeConfigName)

    // A RouteConfiguration contains all the virtual hosts and routes.
    routeConfig := makeRouteConfig(routeConfigName, routes)

    snapshot, err := cache.NewSnapshot(
        versionStr,
        map[resource.Type][]types.Resource{
            resource.ClusterType:  clusters,
            resource.RouteType:    {routeConfig},
            resource.ListenerType: {listener},
            // EDS (Endpoint Discovery Service) is handled by Envoy's DNS resolver 
            // for 'STRICT_DNS' cluster type, so we don't need to provide Endpoints here.
            // This simplifies our control plane significantly.
        },
    )

    if err != nil {
        log.Errorf("Failed to create snapshot version %s: %v", versionStr, err)
        return err
    }

    if err := snapshot.Consistent(); err != nil {
        log.Errorf("Snapshot version %s is not consistent: %v", versionStr, err)
        return err
    }

    // Set the new snapshot for our node ID.
    // All connected Envoy proxies with this node ID will receive this update.
    if err := c.snapshotCache.SetSnapshot(context.Background(), c.nodeID, snapshot); err != nil {
        log.Errorf("Failed to set snapshot version %s: %v", versionStr, err)
        return err
    }
    
    log.Infof("Successfully set snapshot version %s", versionStr)
    return nil
}

3. 生成 Envoy 资源

这里的 makeCluster, makeRoute, makeHTTPListener 等函数是核心的转换逻辑。它们使用 go-control-plane 提供的 protobuf 结构体来构建 Envoy 配置。

一个关键的决策是 Cluster 的服务发现类型。我们选择了 STRICT_DNS。这意味着 Envoy 会自己去解析 Kubernetes Service 的 DNS 名称(例如 docs-main-svc.docs.svc.cluster.local)来找到后端的 Pod IP。这极大地简化了我们的控制平面:我们无需实现 EDS (Endpoint Discovery Service) 来追踪每个 Service 对应的 Pod IP 列表。这是在我们的场景(后端 Service 相对稳定)下的一个务实权衡。

// xds/resources.go

package xds

import (
	"fmt"
	"time"

	core "github.com/envoyproxy/go-control-plane/envoy/config/core/v3"
	listener "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3"
	route "github.com/envoyproxy/go-control-plane/envoy/config/route/v3"
	hcm "github.com/envoyproxy/go-control-plane/envoy/extensions/filters/network/http_connection_manager/v3"
	"github.com/envoyproxy/go-control-plane/pkg/wellknown"
	"google.golang.org/protobuf/types/known/anypb"
	"google.golang.org/protobuf/types/known/durationpb"
    // ... other imports
)

// makeCluster creates a STRICT_DNS cluster.
func makeCluster(clusterName, serviceName, serviceNamespace string, servicePort uint32) *cluster.Cluster {
	return &cluster.Cluster{
		Name:                 clusterName,
		ConnectTimeout:       durationpb.New(5 * time.Second),
		ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_STRICT_DNS},
		// In a Kubernetes environment, a service can be resolved via DNS.
		// e.g., my-service.my-namespace.svc.cluster.local
		LoadAssignment: &endpoint.ClusterLoadAssignment{
			ClusterName: clusterName,
			Endpoints: []*endpoint.LocalityLbEndpoints{
				{
					LbEndpoints: []*endpoint.LbEndpoint{
						{
							HostIdentifier: &endpoint.LbEndpoint_Endpoint{
								Endpoint: &endpoint.Endpoint{
									Address: &core.Address{
										Address: &core.Address_SocketAddress{
											SocketAddress: &core.SocketAddress{
												Protocol:      core.SocketAddress_TCP,
												Address:       fmt.Sprintf("%s.%s.svc.cluster.local", serviceName, serviceNamespace),
												PortSpecifier: &core.SocketAddress_PortValue{PortValue: servicePort},
											},
										},
									},
								},
							},
						},
					},
				},
			},
		},
	}
}

// makeRoute creates a route for a specific domain pointing to a cluster.
func makeRoute(routeName, domain, clusterName string) *route.VirtualHost {
    return &route.VirtualHost{
        Name:    routeName,
        Domains: []string{domain},
        Routes: []*route.Route{
            {
                Match: &route.RouteMatch{
                    PathSpecifier: &route.RouteMatch_Prefix{
                        Prefix: "/",
                    },
                },
                Action: &route.Route_Route{
                    Route: &route.RouteAction{
                        ClusterSpecifier: &route.RouteAction_Cluster{
                            Cluster: clusterName,
                        },
                    },
                },
            },
        },
    }
}

// makeHTTPListener creates a basic HTTP listener.
func makeHTTPListener(listenerName, routeConfigName string) *listener.Listener {
	// ... implementation to create a listener with an HttpConnectionManager filter
    // that points to our dynamic RouteConfiguration via RDS.
}

// makeRouteConfig aggregates virtual hosts into a single RouteConfiguration.
func makeRouteConfig(configName string, virtualHosts []*route.VirtualHost) *route.RouteConfiguration {
    return &route.RouteConfiguration{
        Name:         configName,
        VirtualHosts: virtualHosts,
    }
}

4. Kubernetes 控制器逻辑

现在,我们需要将 Kubernetes 的事件流与我们的 xDS 缓存连接起来。controller-runtime 库让这件事变得相对简单。我们创建一个 Reconciler,它的 Reconcile 方法会在 StaticSite 资源发生变化时被调用。

// internal/controller/staticsite_controller.go

package controller

import (
	"context"
	
	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	platformv1alpha1 "my-company.com/staticsite-operator/api/v1alpha1"
	"my-company.com/staticsite-operator/internal/xds"
)

type StaticSiteReconciler struct {
	client.Client
	Scheme   *runtime.Scheme
	XdsCache *xds.XdsCache // Our xDS cache instance
}

// Reconcile is the main loop. It's triggered by changes to StaticSite resources.
func (r *StaticSiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)
	logger.Info("Reconciling StaticSite resource")

	// The core logic: whenever any StaticSite resource changes, we fetch ALL
	// StaticSite resources in the cluster and regenerate the ENTIRE snapshot.
	// This is an "aggregate" reconciliation pattern. It's simpler to implement
	// and ensures consistency, at the cost of being less efficient for very large
	// numbers of resources. For our scale (hundreds of sites), this is acceptable.

	var siteList platformv1alpha1.StaticSiteList
	if err := r.List(ctx, &siteList); err != nil {
		logger.Error(err, "unable to fetch StaticSite list")
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}
    
    // This is the trigger point. We pass the full list of current sites
    // to our cache to generate and publish a new snapshot.
	if err := r.XdsCache.GenerateSnapshot(siteList.Items); err != nil {
        logger.Error(err, "failed to generate and set xDS snapshot")
        // We can optionally update the status of all resources to "Error" here.
		return ctrl.Result{}, err
	}

	logger.Info("Successfully reconciled and updated xDS snapshot")
	return ctrl.Result{}, nil
}

// SetupWithManager sets up the controller with the Manager.
func (r *StaticSiteReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&platformv1alpha1.StaticSite{}).
		Complete(r)
}

部署与集成

Envoy Bootstrap 配置

Envoy 实例(我们的数据平面)在启动时需要一个静态的 bootstrap 配置文件,告诉它去哪里寻找它的控制平面。

# envoy-bootstrap.yaml
node:
  # This ID must match the nodeID used by the control plane cache.
  id: "ssg-gateway"
  cluster: "ssg-gateway-cluster"

static_resources: {} # All resources are dynamic.

dynamic_resources:
  lds_config:
    ads: {}
  cds_config:
    ads: {}
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services:
      - envoy_grpc:
          # This is the Kubernetes Service name for our Go control plane.
          cluster_name: xds_cluster
    set_node_on_first_message_only: true

# This cluster definition tells Envoy how to connect to our control plane.
# It is the ONLY static configuration Envoy needs.
static_resources:
  clusters:
  - name: xds_cluster
    type: STRICT_DNS
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    load_assignment:
      cluster_name: xds_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                # The control plane is exposed via a service named 'staticsite-operator-xds-svc' on port 9000
                address: staticsite-operator-xds-svc.operators.svc.cluster.local
                port_value: 9000

我们将这个配置文件打包进 Envoy 的容器镜像,或者通过 ConfigMap 挂载。

EKS 部署清单

最后,我们将控制平面和 Envoy 网关部署到 EKS 集群。

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: staticsite-operator
  namespace: operators
spec:
  replicas: 1
  template:
    spec:
      serviceAccountName: staticsite-operator-sa
      containers:
      - name: operator
        image: <your-repo>/staticsite-operator:latest
        command: ["/manager"]
        args: ["--leader-elect"]
---
apiVersion: v1
kind: Service
metadata:
  name: staticsite-operator-xds-svc
  namespace: operators
spec:
  type: ClusterIP
  ports:
  - name: grpc-xds
    port: 9000
    targetPort: 9000
  selector:
    app.kubernetes.io/name: staticsite-operator
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ssg-envoy-gateway
  namespace: ingress
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: envoy
        image: envoyproxy/envoy:v1.27.0
        args:
        - "-c"
        - "/etc/envoy/envoy-bootstrap.yaml"
        # Add other args for logging, concurrency etc.
        ports:
        - containerPort: 8080 # This matches the port in our dynamic listener
        volumeMounts:
        - name: envoy-config
          mountPath: /etc/envoy
      volumes:
      - name: envoy-config
        configMap:
          name: ssg-envoy-bootstrap-cm

通过这个方案,我们实现了一个高效、专注的平台组件。开发者通过他们熟悉的 Kubernetes 原生方式(提交 YAML)来部署和管理静态站点,而底层的路由复杂性被完全屏蔽。当一个 StaticSite CRD 被创建或修改时,我们的 Go 控制平面几乎是瞬时地计算出新的配置快照,并通过 xDS 推送给所有 Envoy 代理。整个过程无需重载,对现有流量毫无影响,完美解决了我们最初面临的性能和稳定性瓶颈。

局限与未来路径

这个方案并非没有缺点。当前的设计将所有路由配置聚合到单一的 RouteConfiguration 中。当站点数量达到数千级别时,每次微小变更都重新生成和传输完整的配置,会对控制平面和 Envoy 的内存造成压力。一个明确的优化路径是转向增量 xDS(Delta xDS),它允许控制平面只发送变更的部分,显著降低了配置更新的开销。

此外,当前的控制平面是一个单点,虽然 Kubernetes 的 Deployment 会在它崩溃后自动重启,但在重启期间无法响应配置变更。可以通过实现主备模式(Leader Election)来提高其可用性。

最后,可以进一步丰富 StaticSite CRD 的功能,例如增加对 TLS 证书的自动管理(集成 cert-manager)、支持简单的重写规则或自定义 Header 等,将更多 Envoy 的强大功能以声明式 API 的形式暴露给开发者。


  目录