大规模Agent部署与资源管理

概述与动机

随着Agent应用在企业级场景中的广泛采用，如何高效部署和管理大规模Agent系统成为关键技术挑战。单机部署已经无法满足现代应用的需求，我们需要采用云原生架构、容器化部署、自动化编排等技术来支撑成百上千个Agent实例的运行。大规模部署不仅要考虑技术可行性，还要关注成本效益、运维效率、系统可靠性等多个维度。

大规模Agent部署面临着独特的挑战。Agent系统通常涉及复杂的组件依赖，包括大语言模型、向量数据库、缓存系统、监控系统等。这些组件需要协同工作，每个组件的可用性和性能都直接影响整个系统的质量。此外，Agent系统的资源需求变化较大，根据任务复杂度和并发量的不同，计算资源、内存资源、存储资源的需求可能在短时间内发生剧烈变化。

资源管理是大规模部署的核心。如何在保证服务质量的前提下，最大化资源利用率，降低运营成本，是每个运维团队必须解决的问题。资源管理包括资源分配、调度优化、容量规划、成本控制等多个方面。有效的资源管理能够显著提升系统性能，降低运营成本，提高用户满意度。

本文将深入探讨大规模Agent系统的部署和资源管理技术，从架构设计到具体实现，提供完整的解决方案和最佳实践，帮助读者构建高效、经济、可靠的大规模Agent系统。

核心概念与架构设计

大规模部署架构

大规模Agent部署需要采用层次化的架构设计，如下图所示：

Rendering diagram...

基础设施层：提供基础的计算、存储、网络资源，包括云计算平台、容器编排系统、存储系统、网络架构等。基础设施层为上层应用提供稳定、可扩展的基础支撑。

平台层：提供Agent运行所需的各种服务，包括Agent运行时、模型服务、数据处理、缓存系统等。平台层封装了底层复杂性，为应用层提供简化的服务接口。

应用层：实现具体的业务逻辑，包括业务Agent、工具Agent、协调Agent、监控Agent等。应用层是用户直接交互的层次，决定了系统的业务价值。

管理层：提供运维管理功能，包括自动扩展、负载均衡、监控告警、配置管理等。管理层确保系统的稳定运行和高效运维。

水平扩展策略

水平扩展是通过增加实例数量来提升系统性能和容量的主要策略：

无状态服务扩展：无状态服务最容易扩展，可以随意增加或减少实例。Agent服务应该尽量设计为无状态，将状态存储在外部系统中。无状态扩展策略简单高效，是水平扩展的首选。

有状态服务扩展：有状态服务的扩展需要考虑数据一致性和状态同步。常用的策略包括数据分片、状态复制、读写分离等。有状态扩展复杂度高，需要精心设计。

微服务架构：将系统拆分为多个独立的微服务，每个服务可以独立扩展。微服务架构提供了灵活的扩展能力，但也增加了系统复杂性。需要合理设计服务边界和通信机制。

多区域部署：在不同地理区域部署服务实例，提升全球用户访问体验。多区域部署需要考虑数据同步、延迟优化、容灾备份等问题。

资源调度算法

资源调度决定了系统资源的高效利用，主要包括：

基于权重的调度：根据服务的重要性分配资源权重，权重高的服务获得更多资源。权重调度确保重要服务的优先性，但可能导致资源分配不均。

基于需求的调度：根据服务的实际资源需求动态分配资源。需求调度提高资源利用率，但需要准确的资源需求预测。

基于时间的调度：根据历史流量模式，在不同时间段分配不同数量的资源。时间调度优化成本效益，但需要准确的模式识别。

智能调度：使用机器学习预测资源需求，实现智能调度决策。智能调度提供了最优的资源分配，但需要大量的历史数据和模型训练。

自动扩展机制

自动扩展是应对流量变化的核心机制：

水平扩展：通过增加服务实例数量来应对流量增长。水平扩展需要考虑启动时间、健康检查、流量分配等因素。是最常用的扩展方式。

垂直扩展：通过增加单个实例的资源容量来提升性能。垂直扩展相对简单，但受限于硬件资源，且存在单点故障风险。

预测性扩展：基于历史数据和预测模型，提前扩展资源应对预期流量增长。预测性扩展提供了更好的用户体验，但需要准确的预测模型。

事件驱动扩展：基于特定事件触发扩展，如新功能发布、营销活动等。事件驱动扩展提供了精确的控制，但需要准确的事件识别和响应机制。

关键技术实现

Kubernetes Operator实现

下面是一个完整的Kubernetes Operator实现，用于管理Agent系统的部署和扩展：

import asyncio
import logging
from typing import Dict, List, Optional, Any
from dataclasses import dataclass, field
from enum import Enum
import json
import time

try:
    from kubernetes import client, config, watch
    from kubernetes.client.rest import ApiException
    KUBERNETES_AVAILABLE = True
except ImportError:
    KUBERNETES_AVAILABLE = False

# 设置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class AgentState(Enum):
    """Agent状态"""
    PENDING = "pending"
    RUNNING = "running"
    SCALING = "scaling"
    ERROR = "error"
    STOPPED = "stopped"

@dataclass
class AgentSpec:
    """Agent规格"""
    name: str
    image: str
    replicas: int = 1
    resources: Dict[str, str] = field(default_factory=lambda: {
        'cpu': '500m',
        'memory': '512Mi'
    })
    env_vars: Dict[str, str] = field(default_factory=dict)
    ports: List[int] = field(default_factory=lambda: [8080])
    model_config: Optional[Dict[str, Any]] = None

@dataclass
class AgentStatus:
    """Agent状态"""
    state: AgentState = AgentState.PENDING
    current_replicas: int = 0
    desired_replicas: int = 1
    ready_replicas: int = 0
    message: str = ""
    last_updated: float = field(default_factory=time.time)

class KubernetesManager:
    """Kubernetes管理器"""
    
    def __init__(self, kubeconfig: Optional[str] = None, namespace: str = "default"):
        """
        初始化Kubernetes管理器
        
        Args:
            kubeconfig: Kubernetes配置文件路径
            namespace: 命名空间
        """
        if not KUBERNETES_AVAILABLE:
            raise ImportError("kubernetes package is required")
        
        self.namespace = namespace
        
        try:
            # 加载Kubernetes配置
            if kubeconfig:
                config.load_kube_config(config_file=kubeconfig)
            else:
                # 尝试集群内配置
                config.load_incluster_config()
            
            # 初始化API客户端
            self.v1 = client.CoreV1Api()
            self.apps_v1 = client.AppsV1Api()
            self.custom_api = client.CustomObjectsApi()
            
            logger.info(f"Connected to Kubernetes cluster, namespace: {namespace}")
            
        except Exception as e:
            logger.error(f"Failed to connect to Kubernetes: {e}")
            raise
    
    def create_deployment(
        self,
        name: str,
        image: str,
        replicas: int = 1,
        resources: Optional[Dict[str, str]] = None,
        env_vars: Optional[Dict[str, str]] = None,
        ports: Optional[List[int]] = None
    ) -> Dict[str, Any]:
        """
        创建Deployment
        
        Args:
            name: 部署名称
            image: 镜像名称
            replicas: 副本数量
            resources: 资源配置
            env_vars: 环境变量
            ports: 端口列表
            
        Returns:
            创建的Deployment信息
        """
        # 构建容器规格
        container_spec = client.V1Container(
            name=name,
            image=image,
            resources=client.V1ResourceRequirements(
                requests=resources or {
                    'cpu': '500m',
                    'memory': '512Mi'
                },
                limits=resources or {
                    'cpu': '1000m',
                    'memory': '1Gi'
                }
            ),
            env=[
                client.V1EnvVar(name=k, value=v)
                for k, v in (env_vars or {}).items()
            ],
            ports=[
                client.V1ContainerPort(container_port=port)
                for port in (ports or [8080])
            ],
            liveness_probe=client.V1Probe(
                http_get=client.V1HTTPGetAction(
                    path="/health",
                    port=8080
                ),
                initial_delay_seconds=30,
                period_seconds=10
            ),
            readiness_probe=client.V1Probe(
                http_get=client.V1HTTPGetAction(
                    path="/ready",
                    port=8080
                ),
                initial_delay_seconds=10,
                period_seconds=5
            )
        )
        
        # 构建Pod模板
        pod_template = client.V1PodTemplateSpec(
            metadata=client.V1ObjectMeta(
                labels={"app": name}
            ),
            spec=client.V1PodSpec(
                containers=[container_spec]
            )
        )
        
        # 构建Deployment规格
        deployment_spec = client.V1DeploymentSpec(
            replicas=replicas,
            selector=client.V1LabelSelector(
                match_labels={"app": name}
            ),
            template=pod_template
        )
        
        # 构建Deployment对象
        deployment = client.V1Deployment(
            api_version="apps/v1",
            kind="Deployment",
            metadata=client.V1ObjectMeta(
                name=name,
                namespace=self.namespace
            ),
            spec=deployment_spec
        )
        
        try:
            # 创建Deployment
            result = self.apps_v1.create_namespaced_deployment(
                namespace=self.namespace,
                body=deployment
            )
            
            logger.info(f"Created deployment: {name}")
            return {
                'name': name,
                'uid': result.metadata.uid,
                'replicas': replicas,
                'status': 'created'
            }
            
        except ApiException as e:
            logger.error(f"Failed to create deployment {name}: {e}")
            raise
    
    def scale_deployment(self, name: str, replicas: int) -> Dict[str, Any]:
        """
        扩展Deployment
        
        Args:
            name: 部署名称
            replicas: 目标副本数量
            
        Returns:
            扩展结果
        """
        try:
            # 获取当前Deployment
            deployment = self.apps_v1.read_namespaced_deployment(
                name=name,
                namespace=self.namespace
            )
            
            # 更新副本数量
            deployment.spec.replicas = replicas
            
            # 应用更新
            result = self.apps_v1.patch_namespaced_deployment(
                name=name,
                namespace=self.namespace,
                body=deployment
            )
            
            logger.info(f"Scaled deployment {name} to {replicas} replicas")
            return {
                'name': name,
                'replicas': replicas,
                'status': 'scaled'
            }
            
        except ApiException as e:
            logger.error(f"Failed to scale deployment {name}: {e}")
            raise
    
    def get_deployment_status(self, name: str) -> Dict[str, Any]:
        """
        获取Deployment状态
        
        Args:
            name: 部署名称
            
        Returns:
            状态信息
        """
        try:
            deployment = self.apps_v1.read_namespaced_deployment(
                name=name,
                namespace=self.namespace
            )
            
            # 获取Pod状态
            pods = self.v1.list_namespaced_pod(
                namespace=self.namespace,
                label_selector=f"app={name}"
            )
            
            ready_pods = sum(1 for pod in pods.items if pod.status.phase == "Running")
            
            return {
                'name': name,
                'replicas': deployment.spec.replicas,
                'ready_replicas': deployment.status.ready_replicas,
                'available_replicas': deployment.status.available_replicas,
                'updated_replicas': deployment.status.updated_replicas,
                'ready_pods': ready_pods,
                'total_pods': len(pods.items)
            }
            
        except ApiException as e:
            logger.error(f"Failed to get deployment status {name}: {e}")
            return {
                'name': name,
                'error': str(e)
            }
    
    def delete_deployment(self, name: str) -> Dict[str, Any]:
        """
        删除Deployment
        
        Args:
            name: 部署名称
            
        Returns:
            删除结果
        """
        try:
            self.apps_v1.delete_namespaced_deployment(
                name=name,
                namespace=self.namespace
            )
            
            logger.info(f"Deleted deployment: {name}")
            return {
                'name': name,
                'status': 'deleted'
            }
            
        except ApiException as e:
            logger.error(f"Failed to delete deployment {name}: {e}")
            raise

class AgentController:
    """Agent控制器"""
    
    def __init__(self, k8s_manager: KubernetesManager):
        """
        初始化Agent控制器
        
        Args:
            k8s_manager: Kubernetes管理器
        """
        self.k8s_manager = k8s_manager
        self.agents: Dict[str, AgentStatus] = {}
        self.running = False
        self.control_loop_interval = 10  # 控制循环间隔(秒)
    
    def create_agent(self, spec: AgentSpec) -> str:
        """
        创建Agent
        
        Args:
            spec: Agent规格
            
        Returns:
            Agent名称
        """
        try:
            # 创建Kubernetes Deployment
            self.k8s_manager.create_deployment(
                name=spec.name,
                image=spec.image,
                replicas=spec.replicas,
                resources=spec.resources,
                env_vars=spec.env_vars,
                ports=spec.ports
            )
            
            # 创建Agent状态
            self.agents[spec.name] = AgentStatus(
                state=AgentState.PENDING,
                desired_replicas=spec.replicas,
                message="Agent creation initiated"
            )
            
            logger.info(f"Created agent: {spec.name}")
            return spec.name
            
        except Exception as e:
            logger.error(f"Failed to create agent {spec.name}: {e}")
            raise
    
    def scale_agent(self, name: str, replicas: int) -> Dict[str, Any]:
        """
        扩展Agent
        
        Args:
            name: Agent名称
            replicas: 目标副本数量
            
        Returns:
            扩展结果
        """
        if name not in self.agents:
            raise ValueError(f"Agent {name} not found")
        
        try:
            # 扩展Kubernetes Deployment
            result = self.k8s_manager.scale_deployment(name, replicas)
            
            # 更新Agent状态
            self.agents[name].state = AgentState.SCALING
            self.agents[name].desired_replicas = replicas
            self.agents[name].message = f"Scaling to {replicas} replicas"
            self.agents[name].last_updated = time.time()
            
            logger.info(f"Scaled agent {name} to {replicas} replicas")
            return result
            
        except Exception as e:
            logger.error(f"Failed to scale agent {name}: {e}")
            raise
    
    def get_agent_status(self, name: str) -> Optional[AgentStatus]:
        """
        获取Agent状态
        
        Args:
            name: Agent名称
            
        Returns:
            Agent状态
        """
        if name not in self.agents:
            return None
        
        try:
            # 获取Kubernetes状态
            k8s_status = self.k8s_manager.get_deployment_status(name)
            
            # 更新Agent状态
            agent_status = self.agents[name]
            agent_status.current_replicas = k8s_status.get('replicas', 0)
            agent_status.ready_replicas = k8s_status.get('ready_replicas', 0)
            agent_status.last_updated = time.time()
            
            # 确定状态
            if agent_status.current_replicas == 0:
                agent_status.state = AgentState.STOPPED
            elif agent_status.ready_replicas == agent_status.desired_replicas:
                agent_status.state = AgentState.RUNNING
            elif agent_status.ready_replicas < agent_status.desired_replicas:
                agent_status.state = AgentState.SCALING
            else:
                agent_status.state = AgentState.ERROR
            
            return agent_status
            
        except Exception as e:
            logger.error(f"Failed to get agent status {name}: {e}")
            return self.agents[name]
    
    def delete_agent(self, name: str) -> Dict[str, Any]:
        """
        删除Agent
        
        Args:
            name: Agent名称
            
        Returns:
            删除结果
        """
        if name not in self.agents:
            raise ValueError(f"Agent {name} not found")
        
        try:
            # 删除Kubernetes Deployment
            result = self.k8s_manager.delete_deployment(name)
            
            # 移除Agent状态
            del self.agents[name]
            
            logger.info(f"Deleted agent: {name}")
            return result
            
        except Exception as e:
            logger.error(f"Failed to delete agent {name}: {e}")
            raise
    
    async def control_loop(self):
        """控制循环"""
        logger.info("Starting control loop")
        self.running = True
        
        while self.running:
            try:
                # 同步所有Agent状态
                for agent_name in list(self.agents.keys()):
                    self.get_agent_status(agent_name)
                
                # 等待下一个循环
                await asyncio.sleep(self.control_loop_interval)
                
            except Exception as e:
                logger.error(f"Error in control loop: {e}")
                await asyncio.sleep(self.control_loop_interval)
        
        logger.info("Control loop stopped")
    
    def start(self):
        """启动控制器"""
        if not self.running:
            asyncio.create_task(self.control_loop())
            logger.info("Agent controller started")
    
    def stop(self):
        """停止控制器"""
        self.running = False
        logger.info("Agent controller stopped")

class AutoScaler:
    """自动扩展器"""
    
    def __init__(
        self,
        controller: AgentController,
        min_replicas: int = 1,
        max_replicas: int = 10,
        target_cpu_utilization: float = 70.0,
        scale_up_cooldown: int = 60,
        scale_down_cooldown: int = 300
    ):
        """
        初始化自动扩展器
        
        Args:
            controller: Agent控制器
            min_replicas: 最小副本数
            max_replicas: 最大副本数
            target_cpu_utilization: 目标CPU利用率
            scale_up_cooldown: 扩展冷却时间
            scale_down_cooldown: 缩容冷却时间
        """
        self.controller = controller
        self.min_replicas = min_replicas
        self.max_replicas = max_replicas
        self.target_cpu_utilization = target_cpu_utilization
        self.scale_up_cooldown = scale_up_cooldown
        self.scale_down_cooldown = scale_down_cooldown
        
        self.last_scale_time: Dict[str, float] = {}
        self.running = False
    
    def _should_scale(self, agent_name: str, direction: str) -> bool:
        """
        判断是否应该扩展
        
        Args:
            agent_name: Agent名称
            direction: 扩展方向
            
        Returns:
            是否应该扩展
        """
        cooldown = self.scale_up_cooldown if direction == "up" else self.scale_down_cooldown
        last_time = self.last_scale_time.get(agent_name, 0)
        
        return (time.time() - last_time) >= cooldown
    
    def _calculate_desired_replicas(
        self,
        current_replicas: int,
        cpu_utilization: float
    ) -> int:
        """
        计算期望副本数
        
        Args:
            current_replicas: 当前副本数
            cpu_utilization: CPU利用率
            
        Returns:
            期望副本数
        """
        if cpu_utilization < self.target_cpu_utilization:
            # 需要缩容
            if cpu_utilization < self.target_cpu_utilization * 0.5:
                # 利用率很低，大幅缩容
                new_replicas = max(
                    self.min_replicas,
                    current_replicas - 2
                )
            else:
                # 适度缩容
                new_replicas = max(
                    self.min_replicas,
                    current_replicas - 1
                )
        else:
            # 需要扩展
            ratio = cpu_utilization / self.target_cpu_utilization
            new_replicas = min(
                self.max_replicas,
                int(current_replicas * ratio)
            )
        
        return new_replicas
    
    def scale_agent(self, agent_name: str, cpu_utilization: float):
        """
        扩展Agent
        
        Args:
            agent_name: Agent名称
            cpu_utilization: CPU利用率
        """
        try:
            # 获取Agent状态
            agent_status = self.controller.get_agent_status(agent_name)
            if not agent_status:
                logger.warning(f"Agent {agent_name} not found")
                return
            
            current_replicas = agent_status.current_replicas
            
            # 计算期望副本数
            desired_replicas = self._calculate_desired_replicas(
                current_replicas,
                cpu_utilization
            )
            
            # 判断是否需要扩展
            if desired_replicas == current_replicas:
                return
            
            # 判断扩展方向
            direction = "up" if desired_replicas > current_replicas else "down"
            
            # 检查冷却时间
            if not self._should_scale(agent_name, direction):
                logger.info(
                    f"Agent {agent_name} is in cooldown period, "
                    f"skipping {direction} scaling"
                )
                return
            
            # 执行扩展
            logger.info(
                f"Scaling agent {agent_name} from {current_replicas} "
                f"to {desired_replicas} replicas (CPU: {cpu_utilization:.1f}%)"
            )
            
            self.controller.scale_agent(agent_name, desired_replicas)
            self.last_scale_time[agent_name] = time.time()
            
        except Exception as e:
            logger.error(f"Failed to scale agent {agent_name}: {e}")
    
    async def auto_scale_loop(self, agent_name: str, check_interval: int = 30):
        """
        自动扩展循环
        
        Args:
            agent_name: Agent名称
            check_interval: 检查间隔
        """
        logger.info(f"Starting auto-scale loop for {agent_name}")
        self.running = True
        
        while self.running:
            try:
                # 获取Agent状态
                agent_status = self.controller.get_agent_status(agent_name)
                if not agent_status:
                    logger.warning(f"Agent {agent_name} not found")
                    await asyncio.sleep(check_interval)
                    continue
                
                # 模拟获取CPU利用率
                # 实际中应该从监控系统获取真实数据
                cpu_utilization = self._simulate_cpu_utilization(agent_status)
                
                # 执行扩展决策
                self.scale_agent(agent_name, cpu_utilization)
                
                # 等待下一个检查周期
                await asyncio.sleep(check_interval)
                
            except Exception as e:
                logger.error(f"Error in auto-scale loop for {agent_name}: {e}")
                await asyncio.sleep(check_interval)
        
        logger.info(f"Auto-scale loop stopped for {agent_name}")
    
    def _simulate_cpu_utilization(self, agent_status: AgentStatus) -> float:
        """
        模拟CPU利用率
        
        Args:
            agent_status: Agent状态
            
        Returns:
            模拟的CPU利用率
        """
        # 这里应该从真实的监控系统获取数据
        # 为演示目的，我们返回一个模拟值
        import random
        
        if agent_status.state == AgentState.RUNNING:
            # 运行状态返回较高的CPU利用率
            return random.uniform(60, 90)
        else:
            # 其他状态返回较低的CPU利用率
            return random.uniform(10, 40)
    
    def start_scaling(self, agent_name: str):
        """启动自动扩展"""
        if not self.running:
            asyncio.create_task(self.auto_scale_loop(agent_name))
            logger.info(f"Auto-scaling started for {agent_name}")

# 使用示例
async def demonstrate_kubernetes_deployment():
    """演示Kubernetes部署的使用"""
    print("=== Kubernetes部署演示 ===\n")
    
    if not KUBERNETES_AVAILABLE:
        print("Kubernetes包不可用，跳过演示")
        print("请安装: pip install kubernetes")
        return
    
    try:
        # 初始化Kubernetes管理器
        k8s_manager = KubernetesManager(namespace="agent-system")
        
        # 初始化Agent控制器
        controller = AgentController(k8s_manager)
        controller.start()
        
        # 创建Agent规格
        agent_spec = AgentSpec(
            name="code-review-agent",
            image="agent-system/code-review:latest",
            replicas=2,
            resources={
                'cpu': '1000m',
                'memory': '1Gi'
            },
            env_vars={
                'MODEL_NAME': 'gpt-4',
                'MAX_CONCURRENT_TASKS': '5'
            },
            ports=[8080]
        )
        
        print("创建Agent:")
        print("-" * 80)
        
        # 创建Agent
        agent_name = controller.create_agent(agent_spec)
        print(f"Agent名称: {agent_name}")
        
        # 等待Agent启动
        await asyncio.sleep(5)
        
        # 获取Agent状态
        print(f"\nAgent状态:")
        print("-" * 80)
        
        for i in range(6):
            status = controller.get_agent_status(agent_name)
            if status:
                print(f"{i+1.} 状态: {status.state.value}")
                print(f"   当前副本: {status.current_replicas}")
                print(f"   就绪副本: {status.ready_replicas}")
                print(f"   期望副本: {status.desired_replicas}")
                print(f"   消息: {status.message}")
            else:
                print(f"{i+1.} Agent未找到")
            
            await asyncio.sleep(10)
        
        # 初始化自动扩展器
        print(f"\n启动自动扩展:")
        print("-" * 80)
        
        auto_scaler = AutoScaler(
            controller=controller,
            min_replicas=1,
            max_replicas=5,
            target_cpu_utilization=70.0
        )
        
        auto_scaler.start_scaling(agent_name)
        
        # 运行一段时间观察自动扩展
        print("运行60秒观察自动扩展行为...")
        await asyncio.sleep(60)
        
        # 获取最终状态
        print(f"\n最终状态:")
        print("-" * 80)
        final_status = controller.get_agent_status(agent_name)
        if final_status:
            print(f"状态: {final_status.state.value}")
            print(f"当前副本: {final_status.current_replicas}")
            print(f"就绪副本: {final_status.ready_replicas}")
        
        # 清理
        print(f"\n清理资源:")
        print("-" * 80)
        controller.delete_agent(agent_name)
        
        controller.stop()
        
    except Exception as e:
        print(f"演示失败: {e}")
        print("请确保Kubernetes集群可用且配置正确")

if __name__ == "__main__":
    if KUBERNETES_AVAILABLE:
        asyncio.run(demonstrate_kubernetes_deployment())
    else:
        print("请安装kubernetes包: pip install kubernetes")

资源调度优化实现

下面是一个智能资源调度优化器的实现：

import time
import heapq
from typing import Dict, List, Optional, Any, Tuple
from dataclasses import dataclass, field
from enum import Enum
import math
import random

class ResourceType(Enum):
    """资源类型"""
    CPU = "cpu"
    MEMORY = "memory"
    GPU = "gpu"
    STORAGE = "storage"

@dataclass
class ResourceRequirement:
    """资源需求"""
    cpu: float = 1.0
    memory: float = 1.0  # GB
    gpu: int = 0
    storage: float = 10.0  # GB
    
    def total_score(self) -> float:
        """计算总资源分数"""
        return self.cpu + self.memory * 0.5 + self.gpu * 4 + self.storage * 0.1

@dataclass(order=True)
class Task:
    """任务"""
    priority: int
    task_id: str
    resource_requirements: ResourceRequirement
    estimated_duration: float = 60.0  # 秒
    arrival_time: float = field(default_factory=time.time)
    deadline: Optional[float] = None
    metadata: Dict[str, Any] = field(default_factory=dict)

@dataclass
class Node:
    """计算节点"""
    node_id: str
    cpu_capacity: float = 4.0
    memory_capacity: float = 16.0  # GB
    gpu_capacity: int = 0
    storage_capacity: float = 100.0  # GB
    
    cpu_used: float = 0.0
    memory_used: float = 0.0
    gpu_used: int = 0
    storage_used: float = 0.0
    
    tasks: List[Task] = field(default_factory=list)
    
    def cpu_available(self) -> float:
        """可用的CPU"""
        return self.cpu_capacity - self.cpu_used
    
    def memory_available(self) -> float:
        """可用的内存"""
        return self.memory_capacity - self.memory_used
    
    def gpu_available(self) -> int:
        """可用的GPU"""
        return self.gpu_capacity - self.gpu_used
    
    def storage_available(self) -> float:
        """可用的存储"""
        return self.storage_capacity - self.storage_used
    
    def cpu_utilization(self) -> float:
        """CPU利用率"""
        return self.cpu_used / self.cpu_capacity if self.cpu_capacity > 0 else 0
    
    def memory_utilization(self) -> float:
        """内存利用率"""
        return self.memory_used / self.memory_capacity if self.memory_capacity > 0 else 0
    
    def can_fit_task(self, task: Task) -> bool:
        """判断是否可以容纳任务"""
        req = task.resource_requirements
        return (
            self.cpu_available() >= req.cpu and
            self.memory_available() >= req.memory and
            self.gpu_available() >= req.gpu and
            self.storage_available() >= req.storage
        )
    
    def allocate_task(self, task: Task):
        """分配任务"""
        req = task.resource_requirements
        self.cpu_used += req.cpu
        self.memory_used += req.memory
        self.gpu_used += req.gpu
        self.storage_used += req.storage
        self.tasks.append(task)
    
    def deallocate_task(self, task: Task):
        """释放任务"""
        req = task.resource_requirements
        self.cpu_used -= req.cpu
        self.memory_used -= req.memory
        self.gpu_used -= req.gpu
        self.storage_used -= req.storage
        
        if task in self.tasks:
            self.tasks.remove(task)

class SchedulingStrategy(Enum):
    """调度策略"""
    FIRST_FIT = "first_fit"           # 首次适应
    BEST_FIT = "best_fit"             # 最佳适应
    WORST_FIT = "worst_fit"           # 最差适应
    ROUND_ROBIN = "round_robin"       # 轮询
    PRIORITY_BASED = "priority"       # 基于优先级
    LOAD_BALANCED = "load_balanced"   # 负载均衡
    ENERGY_AWARE = "energy_aware"     # 能效感知

class ResourceScheduler:
    """资源调度器"""
    
    def __init__(
        self,
        nodes: List[Node],
        strategy: SchedulingStrategy = SchedulingStrategy.BEST_FIT
    ):
        """
        初始化资源调度器
        
        Args:
            nodes: 计算节点列表
            strategy: 调度策略
        """
        self.nodes = nodes
        self.strategy = strategy
        self.task_queue: List[Task] = []
        self.scheduled_tasks: Dict[str, Tuple[Task, Node]] = {}
        self.completed_tasks: List[Task] = []
        self.current_round_index = 0
        
        self.metrics = {
            'total_scheduled': 0,
            'total_completed': 0,
            'total_failed': 0,
            'avg_scheduling_time': 0.0,
            'avg_completion_time': 0.0,
            'node_utilization': {}
        }
    
    def submit_task(self, task: Task):
        """
        提交任务
        
        Args:
            task: 任务对象
        """
        heapq.heappush(self.task_queue, task)
        logger.info(f"Task {task.task_id} submitted")
    
    def _find_best_node(self, task: Task) -> Optional[Node]:
        """
        根据策略找到最佳节点
        
        Args:
            task: 任务对象
            
        Returns:
            最佳节点或None
        """
        suitable_nodes = [
            node for node in self.nodes
            if node.can_fit_task(task)
        ]
        
        if not suitable_nodes:
            return None
        
        if self.strategy == SchedulingStrategy.FIRST_FIT:
            # 首次适应：第一个合适的节点
            return suitable_nodes[0]
        
        elif self.strategy == SchedulingStrategy.BEST_FIT:
            # 最佳适应：剩余资源最少的节点
            return min(
                suitable_nodes,
                key=lambda n: (
                    n.cpu_available() +
                    n.memory_available() * 0.5 +
                    n.gpu_available() * 4
                )
            )
        
        elif self.strategy == SchedulingStrategy.WORST_FIT:
            # 最差适应：剩余资源最多的节点
            return max(
                suitable_nodes,
                key=lambda n: (
                    n.cpu_available() +
                    n.memory_available() * 0.5 +
                    n.gpu_available() * 4
                )
            )
        
        elif self.strategy == SchedulingStrategy.ROUND_ROBIN:
            # 轮询：按顺序选择节点
            selected_node = suitable_nodes[self.current_round_index % len(suitable_nodes)]
            self.current_round_index += 1
            return selected_node
        
        elif self.strategy == SchedulingStrategy.PRIORITY_BASED:
            # 基于优先级：选择最适合高优先级任务的节点
            return min(
                suitable_nodes,
                key=lambda n: abs(n.cpu_utilization() - 0.7) + 
                           abs(n.memory_utilization() - 0.7) * 0.5
            )
        
        elif self.strategy == SchedulingStrategy.LOAD_BALANCED:
            # 负载均衡：选择利用率最低的节点
            return min(
                suitable_nodes,
                key=lambda n: (n.cpu_utilization() + n.memory_utilization()) / 2
            )
        
        elif self.strategy == SchedulingStrategy.ENERGY_AWARE:
            # 能效感知：优先选择已经高利用率的节点
            return max(
                suitable_nodes,
                key=lambda n: (n.cpu_utilization() + n.memory_utilization()) / 2
            )
        
        else:
            return suitable_nodes[0]
    
    def schedule_task(self, task: Task) -> bool:
        """
        调度任务
        
        Args:
            task: 任务对象
            
        Returns:
            是否调度成功
        """
        start_time = time.time()
        
        # 查找最佳节点
        node = self._find_best_node(task)
        
        if node is None:
            logger.warning(f"No suitable node found for task {task.task_id}")
            return False
        
        # 分配任务到节点
        node.allocate_task(task)
        self.scheduled_tasks[task.task_id] = (task, node)
        
        # 更新统计
        scheduling_time = time.time() - start_time
        self.metrics['total_scheduled'] += 1
        self.metrics['avg_scheduling_time'] = (
            (self.metrics['avg_scheduling_time'] * (self.metrics['total_scheduled'] - 1) + 
             scheduling_time) / self.metrics['total_scheduled']
        )
        
        logger.info(
            f"Task {task.task_id} scheduled to node {node.node_id} "
            f"(strategy: {self.strategy.value})"
        )
        
        return True
    
    def schedule_all_pending(self) -> int:
        """
        调度所有待处理任务
        
        Returns:
            成功调度的任务数量
        """
        scheduled_count = 0
        
        while self.task_queue:
            task = heapq.heappop(self.task_queue)
            if self.schedule_task(task):
                scheduled_count += 1
            else:
                # 放回队列
                heapq.heappush(self.task_queue, task)
                break
        
        return scheduled_count
    
    def complete_task(self, task_id: str):
        """
        完成任务
        
        Args:
            task_id: 任务ID
        """
        if task_id not in self.scheduled_tasks:
            logger.warning(f"Task {task_id} not found in scheduled tasks")
            return
        
        task, node = self.scheduled_tasks[task_id]
        
        # 释放资源
        node.deallocate_task(task)
        
        # 移动到完成列表
        self.completed_tasks.append(task)
        del self.scheduled_tasks[task_id]
        
        # 更新统计
        self.metrics['total_completed'] += 1
        completion_time = time.time() - task.arrival_time
        self.metrics['avg_completion_time'] = (
            (self.metrics['avg_completion_time'] * (self.metrics['total_completed'] - 1) + 
             completion_time) / self.metrics['total_completed']
        )
        
        logger.info(f"Task {task_id} completed on node {node.node_id}")
    
    def get_node_utilization(self) -> Dict[str, Dict[str, float]]:
        """
        获取节点利用率
        
        Returns:
            节点利用率字典
        """
        return {
            node.node_id: {
                'cpu_utilization': node.cpu_utilization(),
                'memory_utilization': node.memory_utilization(),
                'gpu_utilization': node.gpu_used / node.gpu_capacity if node.gpu_capacity > 0 else 0,
                'task_count': len(node.tasks)
            }
            for node in self.nodes
        }
    
    def get_cluster_utilization(self) -> Dict[str, float]:
        """
        获取集群利用率
        
        Returns:
            集群利用率字典
        """
        total_cpu_capacity = sum(node.cpu_capacity for node in self.nodes)
        total_cpu_used = sum(node.cpu_used for node in self.nodes)
        
        total_memory_capacity = sum(node.memory_capacity for node in self.nodes)
        total_memory_used = sum(node.memory_used for node in self.nodes)
        
        total_gpu_capacity = sum(node.gpu_capacity for node in self.nodes)
        total_gpu_used = sum(node.gpu_used for node in self.nodes)
        
        return {
            'cpu_utilization': total_cpu_used / total_cpu_capacity if total_cpu_capacity > 0 else 0,
            'memory_utilization': total_memory_used / total_memory_capacity if total_memory_capacity > 0 else 0,
            'gpu_utilization': total_gpu_used / total_gpu_capacity if total_gpu_capacity > 0 else 0,
            'total_tasks': len(self.scheduled_tasks),
            'pending_tasks': len(self.task_queue)
        }
    
    def get_metrics(self) -> Dict[str, Any]:
        """
        获取调度器指标
        
        Returns:
            指标字典
        """
        self.metrics['node_utilization'] = self.get_node_utilization()
        self.metrics['cluster_utilization'] = self.get_cluster_utilization()
        
        return self.metrics
    
    def simulate_task_execution(self, task_id: str):
        """
        模拟任务执行
        
        Args:
            task_id: 任务ID
        """
        if task_id not in self.scheduled_tasks:
            return
        
        task, _ = self.scheduled_tasks[task_id]
        
        # 模拟执行时间
        import random
        actual_duration = task.estimated_duration * random.uniform(0.8, 1.2)
        
        # 模拟异步执行
        import threading
        def complete():
            time.sleep(actual_duration)
            self.complete_task(task_id)
        
        thread = threading.Thread(target=complete)
        thread.daemon = True
        thread.start()

# 演示资源调度
def demonstrate_resource_scheduling():
    """演示资源调度的使用"""
    print("=== 资源调度优化演示 ===\n")
    
    # 创建计算节点
    nodes = [
        Node(
            node_id="node-1",
            cpu_capacity=8.0,
            memory_capacity=32.0,
            gpu_capacity=2,
            storage_capacity=200.0
        ),
        Node(
            node_id="node-2",
            cpu_capacity=4.0,
            memory_capacity=16.0,
            gpu_capacity=0,
            storage_capacity=100.0
        ),
        Node(
            node_id="node-3",
            cpu_capacity=16.0,
            memory_capacity=64.0,
            gpu_capacity=4,
            storage_capacity=500.0
        ),
        Node(
            node_id="node-4",
            cpu_capacity=8.0,
            memory_capacity=32.0,
            gpu_capacity=2,
            storage_capacity=200.0
        )
    ]
    
    # 创建调度器
    scheduler = ResourceScheduler(nodes, strategy=SchedulingStrategy.BEST_FIT)
    
    print("创建测试任务:")
    print("-" * 80)
    
    # 创建各种类型的任务
    tasks = [
        Task(
            priority=1,
            task_id="task-1",
            resource_requirements=ResourceRequirement(cpu=2.0, memory=8.0, gpu=0),
            estimated_duration=30.0
        ),
        Task(
            priority=1,
            task_id="task-2",
            resource_requirements=ResourceRequirement(cpu=1.0, memory=4.0, gpu=0),
            estimated_duration=45.0
        ),
        Task(
            priority=2,
            task_id="task-3",
            resource_requirements=ResourceRequirement(cpu=4.0, memory=16.0, gpu=1),
            estimated_duration=60.0
        ),
        Task(
            priority=1,
            task_id="task-4",
            resource_requirements=ResourceRequirement(cpu=2.0, memory=8.0, gpu=0),
            estimated_duration=30.0
        ),
        Task(
            priority=3,
            task_id="task-5",
            resource_requirements=ResourceRequirement(cpu=8.0, memory=32.0, gpu=2),
            estimated_duration=90.0
        ),
        Task(
            priority=1,
            task_id="task-6",
            resource_requirements=ResourceRequirement(cpu=1.0, memory=4.0, gpu=0),
            estimated_duration=20.0
        ),
        Task(
            priority=2,
            task_id="task-7",
            resource_requirements=ResourceRequirement(cpu=2.0, memory=8.0, gpu=0),
            estimated_duration=35.0
        ),
        Task(
            priority=1,
            task_id="task-8",
            resource_requirements=ResourceRequirement(cpu=3.0, memory=12.0, gpu=1),
            estimated_duration=50.0
        )
    ]
    
    # 提交任务
    for task in tasks:
        scheduler.submit_task(task)
        print(f"提交任务: {task.task_id}, 优先级: {task.priority}, "
              f"CPU: {task.resource_requirements.cpu}, 内存: {task.resource_requirements.memory}, "
              f"GPU: {task.resource_requirements.gpu}")
    
    # 调度所有任务
    print(f"\n调度任务 (策略: {scheduler.strategy.value}):")
    print("-" * 80)
    
    scheduled_count = scheduler.schedule_all_pending()
    print(f"成功调度 {scheduled_count} 个任务")
    
    # 显示调度结果
    print(f"\n节点利用率:")
    print("-" * 80)
    
    node_utilization = scheduler.get_node_utilization()
    for node_id, util in node_utilization.items():
        print(f"{node_id}:")
        print(f"  CPU利用率: {util['cpu_utilization']:.2%}")
        print(f"  内存利用率: {util['memory_utilization']:.2%}")
        print(f"  GPU利用率: {util['gpu_utilization']:.2%}")
        print(f"  运行任务数: {util['task_count']}")
    
    # 显示集群利用率
    print(f"\n集群总体利用率:")
    print("-" * 80)
    
    cluster_util = scheduler.get_cluster_utilization()
    print(f"总CPU利用率: {cluster_util['cpu_utilization']:.2%}")
    print(f"总内存利用率: {cluster_util['memory_utilization']:.2%}")
    print(f"总GPU利用率: {cluster_util['gpu_utilization']:.2%}")
    print(f"运行中任务: {cluster_util['total_tasks']}")
    print(f"待处理任务: {cluster_util['pending_tasks']}")
    
    # 比较不同调度策略
    print(f"\n比较不同调度策略:")
    print("-" * 80)
    
    strategies = [
        SchedulingStrategy.FIRST_FIT,
        SchedulingStrategy.BEST_FIT,
        SchedulingStrategy.WORST_FIT,
        SchedulingStrategy.LOAD_BALANCED
    ]
    
    for strategy in strategies:
        # 创建新的调度器
        test_scheduler = ResourceScheduler(nodes.copy(), strategy=strategy)
        
        # 重新提交任务
        for task in tasks:
            test_scheduler.submit_task(task)
        
        # 调度任务
        test_scheduler.schedule_all_pending()
        
        # 获取集群利用率
        cluster_util = test_scheduler.get_cluster_utilization()
        
        print(f"{strategy.value}:")
        print(f"  CPU利用率: {cluster_util['cpu_utilization']:.2%}")
        print(f"  内存利用率: {cluster_util['memory_utilization']:.2%}")
        print(f"  运行中任务: {cluster_util['total_tasks']}")
        print(f"  待处理任务: {cluster_util['pending_tasks']}")
    
    # 显示详细指标
    print(f"\n详细调度指标:")
    print("-" * 80)
    
    metrics = scheduler.get_metrics()
    print(f"总调度任务数: {metrics['total_scheduled']}")
    print(f"总完成任务数: {metrics['total_completed']}")
    print(f"平均调度时间: {metrics['avg_scheduling_time']:.4f}秒")
    print(f"平均完成时间: {metrics['avg_completion_time']:.2f}秒")

if __name__ == "__main__":
    demonstrate_resource_scheduling()

最佳实践与常见陷阱

大规模部署最佳实践

基础设施即代码：使用基础设施即代码(IaC)管理部署环境，确保部署的一致性和可重复性。常用的IaC工具包括Terraform、Ansible、CloudFormation等。IaC提供版本控制、自动化部署、环境一致性等优势。

蓝绿部署：采用蓝绿部署策略，减少部署风险。维护两个相同的生产环境(蓝和绿)，新版本部署到非活跃环境，验证后切换流量。蓝绿部署提供快速回滚能力和零停机部署。

金丝雀发布：使用金丝雀发布策略，渐进式地推出新版本。先将新版本部署到少量实例，监控表现后再逐步扩大范围。金丝雀发布降低风险，提供问题早期发现能力。

配置管理：实现统一的配置管理，避免配置漂移。使用配置中心或环境变量管理配置，实现配置的集中控制和版本管理。配置管理确保环境一致性和配置可追溯性。

资源管理最佳实践

资源预留：为关键服务预留足够的资源，确保服务质量。资源预留应该基于历史数据和预测模型，考虑峰值需求和增长趋势。预留策略平衡服务质量和成本效益。

弹性伸缩：实现自动弹性伸缩，根据负载动态调整资源。弹性伸缩包括水平扩展和垂直扩展，应该根据服务特性选择合适的策略。自动伸缩提升资源利用率和响应能力。

成本优化：实施成本优化策略，降低资源使用成本。优化策略包括：使用spot实例、合理选择实例类型、优化资源利用率、实施资源清理等。成本优化在保证服务质量的前提下降低运营成本。

容量规划：进行前瞻性容量规划，确保资源供应满足业务增长。容量规划应该基于历史数据、业务预测、技术趋势等多个因素。规划结果应该定期review和调整。

监控和运维最佳实践

全链路监控：实现端到端的监控覆盖，及时发现和定位问题。监控应该包括基础设施、应用服务、业务指标等多个层面。全链路监控提供完整的可观测性。

自动化运维：提高运维自动化程度，减少人工干预。自动化运维包括自动部署、自动扩展、自动恢复、自动告警等。自动化提升运维效率，降低人为错误。

灾难恢复：建立完善的灾难恢复机制，确保业务连续性。灾难恢复包括数据备份、故障切换、异地容灾等。恢复机制应该定期测试和验证。

安全合规：确保部署过程和运行环境符合安全合规要求。安全措施包括网络安全、数据加密、访问控制、审计日志等。合规管理确保满足监管要求。

常见陷阱及解决方案

资源浪费：过度配置导致资源浪费和成本过高。解决方案包括：实施资源监控、使用自动扩展、选择合适的实例类型、定期清理未使用资源。

单点故障：关键组件存在单点故障风险。解决方案包括：实现高可用架构、使用负载均衡、部署多实例、实施故障转移。

配置漂移：环境配置不一致导致问题。解决方案包括：使用基础设施即代码、实施配置管理、建立环境同步机制、定期配置审计。

扩展性限制：架构设计限制了系统的扩展能力。解决方案包括：采用无状态设计、实现数据分片、使用缓存策略、优化数据库设计。

性能优化考虑

部署性能优化

优化部署过程可以显著提升系统性能和用户体验：

容器镜像优化：优化容器镜像大小和构建时间。使用多阶段构建、选择基础镜像、清理不必要文件、利用镜像缓存等策略。镜像优化减少部署时间和资源消耗。

启动优化：优化应用启动时间，提升扩展效率。启动优化包括：延迟加载、并发初始化、资源预分配、连接池预热等。快速启动提升自动扩展的效果。

网络优化：优化网络配置和通信效率。网络优化包括：使用内网通信、启用连接复用、实施CDN加速、优化DNS解析等。网络优化降低延迟，提升性能。

存储优化：优化存储性能和成本。存储优化包括：选择合适的存储类型、实施缓存策略、优化数据布局、使用压缩算法等。存储优化平衡性能和成本。

资源利用优化

提升资源利用率可以显著降低运营成本：

资源共享：实现合理的资源共享机制。资源共享包括：CPU超卖、内存复用、存储共享等。资源共享提升整体利用率，但需要保证服务质量。

负载均衡：实现智能的负载均衡策略。负载均衡算法包括：轮询、最少连接、一致性哈希、地理位置等。均衡算法优化资源分配和系统性能。

缓存策略：实施多层缓存策略减少资源消耗。缓存包括：内存缓存、分布式缓存、CDN缓存等。缓存策略降低后端负载，提升响应速度。

资源回收：及时回收不再使用的资源。资源回收包括：垃圾回收、连接清理、缓存清理、实例终止等。及时回收避免资源浪费。

参考资源

官方文档

Kubernetes Documentation: https://kubernetes.io/docs/ - Kubernetes容器编排平台的官方文档
Docker Documentation: https://docs.docker.com/ - Docker容器技术的官方文档
Helm Documentation: https://helm.sh/docs/ - Kubernetes包管理器Helm的官方文档

技术论文和文章

"Kubernetes: Container Orchestration at Scale": Kubernetes大规模容器编排的技术论文
"The Design and Implementation of a Log-Structured File System": 关于文件系统设计的研究
"Google Borg: Large-scale Cluster Management at Google": Google Borg大规模集群管理的技术论文

开源工具和库

Kubernetes: https://kubernetes.io/ - 容器编排平台
Helm: https://helm.sh/ - Kubernetes包管理器
Prometheus: https://prometheus.io/ - 监控系统
Grafana: https://grafana.com/ - 可视化平台
Terraform: https://www.terraform.io/ - 基础设施即代码工具

实战案例

Large-Scale Kubernetes Deployment: 大规模Kubernetes部署的实战案例
Microservices at Scale: 大规模微服务架构的实践经验
Cloud-Native Architecture: 云原生架构的设计原则和最佳实践

通过本文的学习，读者应该能够掌握大规模Agent系统部署和资源管理的核心技术，能够在实际项目中构建高效、经济、可靠的大规模Agent系统。至此，高级Agent应用系列的第二阶段"Agent性能优化"部分已经完成，读者应该具备了构建高性能、高可用、可扩展的Agent系统的完整技术能力。