基于UDP的边缘云协同高效视觉变换器推理：一种自适应损失检测方法

doi:10.1007/s11390-025-5171-z

基于UDP的边缘云协同高效视觉变换器推理：一种自适应损失检测方法

Efficient Vision Transformer Inference via UDP for Edge-Cloud Collaboration: An Adaptive Loss Detection Approach

摘要

摘要:
研究背景 视觉变换器（vision transformers, ViTs）在计算机视觉任务中表现良好，但在边缘设备部署时面临严峻的计算挑战。随着对大规模变换器结构高性能模型的需求持续增长，如何在边缘云环境中提升ViT模型推理速度、应对网络数据包丢失问题，是当前研究亟待解决的问题。
目的本研究旨在提出一种边缘云协同推理框架，基于用户数据报协议（UDP）传输加速ViT模型在边缘云环境中的推理过程，解决UDP传输固有的不可靠性问题，在提升推理速度的同时，保障预训练ViT模型的结构完整性与性能稳定性。
方法研究设计了高效视觉变换器推理框架（EViTIF），采用边缘云协同架构，通过策略性拆分ViT模型并分配至边缘与云端节点，基于UDP实现低延迟通信。为缓解UDP传输的固有不可靠性，框架集成了包错误率自适应损失检测网络（packet error rate adaptive loss detection network，PALDN），该网络可有效处理数据包丢失导致的数据损坏，实现丢失数据的动态恢复，且无需对模型进行大量重训练。研究在NVIDIA Jetson Xavier NX边缘设备与搭载A100 GPU的云端服务器上开展实验，验证所提框架的性能。
结果实验结果表明，EViTIF框架可显著提升推理速度，同时保留预训练ViT模型的结构与性能。与传统基于传输控制协议（TCP）的方法相比，即便在相似包错误条件下，EViTIF的推理速度最高可提升57倍；当数据包错误率高达60%时，PALDN仍可成功恢复损坏数据，精度损失控制在2%以内。此外，EViTIF具备良好的通用性，可适配多种ViT变体，并能有效扩展至ImageNet等大规模数据集。
结论 EViTIF框架通过平衡计算效率与抗网络缺陷的稳健性，为边缘计算场景下的实时、高性能视觉应用提供了可行方案。未来研究将探索扩展EViTIF的适用范围，使其支持多模态任务（包括视觉与3D渲染应用的协同推理），进一步提升推理速度并拓宽框架的应用场景；同时，研究该框架与模型剪枝、蒸馏技术的潜在融合方案，在保留对有损通信韧性的基础上，进一步降低系统开销。

Abstract: Vision transformers (ViTs) deliver exceptional performance in computer vision tasks but pose significant computational challenges for edge devices. We present an efficient vision transformer inference framework (EViTIF), an edge-cloud collaborative framework that utilizes User Datagram Protocol (UDP) to achieve low-latency communication by strategically partitioning ViT models between edge and cloud environments. To mitigate UDP's inherent unreliability, we introduce the Packet Error Rate Adaptive Loss Detection Network (PALDN), which dynamically recovers lost data without requiring extensive model retraining. Our experiments, conducted on an NVIDIA Jetson Xavier NX edge device and an A100 GPU-equipped cloud server, demonstrate that EViTIF reduces inference latency by up to 57x compared with traditional TCP (Transmission Control Protocol)-based methods. Even with up to 60% packet loss, PALDN maintains accuracy degradation below 2%, outperforming existing super-resolution based recovery approaches. Moreover, EViTIF demonstrates its versatility by generalizing across different ViT variants and scaling effectively to larger datasets like ImageNet. This framework enables real-time, high-performance vision applications in edge computing by balancing computational efficiency with robustness against network imperfections.

HTML全文

参考文献()

施引文献

资源附件()