Yiqiao Qiu 邱奕乔
Computer Vision / ML Engineer • Distributed Systems SDE
Currently Software Engineer at AWS (Datacenter Network Infra — Scalable Intent-Driven Routing).
Previously Computer Vision Engineer Intern at XPeng Motors, ByteDance, and DMAI, and
SDE Intern at Amazon CloudFront. UCSD MS CSE, GPA 3.93.
7 publications · 112 citations.
01 About
My work threads together two layers of the modern AI stack. On the model layer:
computer-vision and multi-modal-LLM algorithm work across the full industrial lifecycle — model
optimization and deployment, research publications, and ML-infrastructure implementation.
Underneath: the massive-scale distributed networking and routing control plane running
AI datacenter clusters — the infrastructure powering every training and inference workload in the
modern AI ecosystem.
02 Overview
Story I
Computer Vision & Multi-Modal
Industrial CV model optimization at XPeng Motors and ByteDance — from training-pipeline acceleration to TikTok live-streaming super-resolution; 7 research papers on continual learning, distillation, and OOD detection; current work on VLM distillation + token compression, Pipeline Parallelism on torchtitan, and Agentic RAG.
Read the full story →
Story II
Distributed Systems
SIDR — AWS's Scalable Intent-Driven Routing protocol, the control plane for datacenter fabrics spanning 7,000+ switch nodes. Feature development, end-to-end ownership of release qualification, production operation, and debugging MPC race conditions under real failures.
Read the full story →
Chronology
Experience
Currently Software Engineer at AWS (SIDR); previously Computer Vision Engineer Intern at XPeng Motors, ByteDance, and DMAI, and SDE Intern at Amazon CloudFront.
See timeline →
Publications
Research
7 publications · 112 citations spanning semantic segmentation, knowledge-distillation-based continual learning, classification, OOD detection, and transfer learning. 1st-author SATS paper in Pattern Recognition.
See publications →
03 Skills
Programming Languages
Python · Rust · C / C++ · Java · Kotlin · shell · SQL
Machine Learning / Deep Learning
Efficient Industrial Model Optimization · Continual Learning · Model Distillation · LoRA fine-tuning · Transfer Learning · Supervised / Semi-Supervised
Computer Vision
Semantic Segmentation · Classification · Object Detection · Super-Resolution · Facial Landmark Detection · Scene Understanding · VQA · Anomaly Detection
Distributed Systems
Large-scale distributed systems · CAP theorem · multi-phase commit · protocol fault tolerance · inter-node communication-cost optimization
Networking
SDN · BGP · OSPF · Quagga zebra · IPv4 / IPv6 · TCP / UDP · Rust Netlink
OS / System
Linux Kernel · Rust Tokio async · gRPC · Docker
ML Infrastructures & Deployment
PyTorch · torchtitan · Pipeline Parallelism · Fully Sharded Data Parallelism (FSDP) · ONNX · NVIDIA DALI · llama.cpp · GGUF quantization
Cloud Services
AWS (DynamoDB · S3 · CloudWatch · CloudFront · CloudFormation)
04 Education
University of California, San Diego
Sep 2022 – Mar 2024
M.S. in Computer Science and Engineering · GPA 3.93 / 4.0
Sun Yat-sen University
Sep 2018 – Jun 2022
B.Eng. in Computer Science · Major GPA 3.94 / 4.0 (top 10%) · Overall GPA 3.8 / 4.0
05 Get in touch
Open to CV / ML Engineer and Distributed Systems SDE roles.
Best reached by email — yiqiaoqiu@hotmail.com.
My ML work tells one story: take the efficient-model mindset — latency, memory, throughput —
from classical CV into modern VLMs and the training systems beneath them.
Industrial efficient-model work
Across three internships I shipped production CV models covering the core task families —
semantic segmentation, object detection, super-resolution, facial-landmark detection, image
classification, and video classification — under strict latency / compute budgets.
-
XPeng Motors — Autonomous Driving Center.
Python, PyTorch, NVIDIA DALI, ONNX · Oct 2023 – Mar 2024.
-
Training-pipeline acceleration for large-scale perception models. Accelerated on-car
perception training by integrating NVIDIA DALI for GPU-based online augmentation
on huge-scale image datasets, offloading preprocessing from CPU to GPU via multi-process pipelines —
7× training speedup and 80% CPU reduction, unblocking faster iteration across
perception teams.
-
Multi-task backbone consolidation for on-car deployment. Merged multiple task-specific
perception models into a unified shared backbone; systematically explored trade-offs across
architectures, FLOPs, and cross-task generalization. Reduced on-car model scheduling and memory
overhead while preserving per-task accuracy, improving deployment efficiency on resource-constrained
automotive compute.
-
Eye-action video classification for the Driver Monitoring System (DMS). Owned end-to-end
development of the eye-action recognition pipeline for in-cabin fatigue and distraction detection —
dataset construction, temporal model design and training, and in-vehicle real-scene validation under
varied lighting and head poses. 99.64% binary classification accuracy with 30%
inference-latency reduction, meeting the real-time on-car constraint.
-
Simulation-driven data augmentation for long-tail object detection. Replenished an
object-detection dataset with photorealistic simulation data for rare / long-tail categories;
validated the pipeline by training YOLO-X on the augmented dataset and demonstrating consistent
mAP gains on underrepresented classes. Co-author of Anything in Any Scene.
-
ByteDance — Real-Time Communication, Video Group.
Python, PyTorch · Nov 2021 – Apr 2022.
-
Real-time multi-frame super-resolution for live-streaming codec enhancement. Proposed novel
auxiliary modules at the low-level encoder / decoder of a real-time multi-frame super-resolution
model, leveraging temporal consistency across adjacent frames and
residual-aware feature fusion to alleviate video-decoding blocky artifacts from
aggressive compression — without inflating inference latency. 43% PSNR gain improvement
in offline testing; deployed into the TikTok live-streaming RTC video engine.
-
Robust facial-landmark detection for ROI-aware bitrate allocation. Optimized the landmark
model driving ROI-based bitrate allocation on streamers' faces via (1) facial-parsing
preprocessing for semantic priors, (2) weighted loss with balanced resampling
for long-tail poses / occlusions, and (3) an auxiliary global-context branch
stabilizing predictions under occlusion and motion blur. 67% NME reduction;
unstructured pruning further cut inference time by 20% to fit the RTC latency budget.
-
Training infrastructure. Built FFmpeg-based offline augmentation simulating
codec artifacts, bitrate ladders, and frame-drop patterns, plus a multi-threaded concurrent
I/O queue hiding I/O latency behind GPU compute — 40% training-time reduction
on large-scale video datasets.
-
DMAI — Research Center.
Python, PyTorch · Jul 2021 – Oct 2021.
Benchmarked and optimized lightweight object detectors (RFB, YOLO-X, YOLO-v5) with
data-augmentation search and loss tuning to suppress false positives (99.5% mAP);
improved classification with open-set loss to resolve 95% of edge-case
failures at 99% precision.
VLM — fine-tuning, token compression, on-device deployment
Distilling Qwen2.5-VL 32B → 3B with LoRA rank-16 on DriveLM-nuScenes; integrated four
visual-token compression methods (FasterVLM, PruMerge, PyramidDrop, and my own SATS-CRP
— region-aware self-attention transfer). 4× token reduction (480 → 120) with DriveLM LoRA accuracy
+1%; 16× extreme compression at only 2.4% degradation; BF16 → GGUF Q4_K_M quantization
(3.9× smaller) deployed on a consumer RTX 4070 Ti via llama.cpp at
170 tokens/s, 142 ms TTFT.
ML Infra — Pipeline Parallelism on torchtitan
My fork of PyTorch torchtitan
(branch 309b462)
implements a Block Attention Residuals experiment composed with multi-stage
pipeline parallelism (PP=4 + cache adapter smoke-green), an AttnRes subclass with a scaling-law
config registry, silent-grad-loss detection with clone-on-capture, a grouped_mm +
torch.compile throughput path, and a LLaVA-style multimodal scaffolding
on top of a Kimi Linear (KDA / MLA / MLP) base.
Agentic RAG + LLM-agent result evaluation harness
An agentic RAG system over SQL + document corpora: LLM orchestrator delegating to specialized sub-agents
via OpenAI-style function calling and a typed Evidence protocol; async sub-agent dispatch with
per-agent timeouts and graceful partial-result synthesis; hybrid BM25 + dense retrieval with
Reciprocal Rank Fusion over ~900 chunks from 6 SEC 10-K filings; SSE streaming of routing decisions, tool
calls, and sub-agent reasoning. Paired with a 4-mode evaluation harness
(fuzzy-numeric / entity-match / LLM-as-judge / deterministic slot-based component recall) that
specifically targets list-coverage silent failures invisible to judge-only scoring. Full-pass correctness
on a 10-question gold-labeled dev set.
I currently build SIDR — AWS's Scalable Intent-Driven Routing protocol and the
routing control plane for AWS Datacenter network fabrics (a production fleet of
7,000+ switch nodes). Under SIDR, routing intent is expressed centrally and distributed /
installed across the fleet via multi-phase commit (MPC) transactions, which makes
correctness, failure-mode reasoning, and fleet-scale operational behavior the central engineering problem
of the system. My work spans SIDR distributed networking protocol feature development,
end-to-end ownership of SIDR release qualification, and
production operation of the control plane.
Three headline contributions:
-
Automated release-qualification framework with chaos fault injection.
Designed and built the complete end-to-end automated release-qualification framework for SIDR, run
against test clusters: a hierarchical concurrent workflow engine orchestrating
multi-stage qualification runs, comprehensive protocol state tracing, and a
chaos fault-injection system that systematically exercises partitions, device
failures, and interface flaps to validate consensus convergence and MPC operation correctness — with
rollback — under extreme stress and chaotic network / software failure conditions. Integrated into
CI/CD; delivered a 15× speedup in release qualification.
Since October 2024, every AWS ML Datacenter switch NetOS release has been qualified and gated
by this framework — it has caught dozens of latent bugs both inside the SIDR
protocol itself and at the interaction boundary between SIDR and other AWS NetOS processes before they
could reach production, keeping every subsequent SIDR release highly reliable at production fleet scale.
-
Core SIDR feature development.
-
SIDR route redistribution from Quagga. Cross-protocol route redistribution bridges legacy
BGP / OSPF state (Quagga zebra) into SIDR's intent-driven route programming. Owned the SIDR-side
daemon logic end-to-end: inter-process communication with Quagga, asynchronous message-stream
parsing, multi-module async orchestration keeping redistribution off the critical
path of intent commits, OS-level signaling, and MPC lazy-initialization
optimizations across multiple SIDR modules.
-
SIDR protocol credential management and security enhancement. Added
message-certificate-based authentication and verification end-to-end across
controller-to-daemon intent distribution. Designed certificate issuance / rotation into SIDR's
service-fabric workflows, explicit race-condition and failure handling during certificate
transitions, and a comprehensive end-to-end integration-test suite covering controller-to-daemon
intent signaling and protocol state-machine transitions. A cross-cutting change — every SIDR
message touched — so backward-compatible rollout mattered as much as the cryptographic component.
-
Deep system understanding from feature development, production operation, and debugging.
Shipping SIDR features is one input; running the control plane in production and diagnosing the many
race conditions that surface when MPC transactions interleave with real link and process failures —
device crashes mid-commit, interface flaps, partial commits, rollback-vs-recommit interleavings — is
another. The combination built a depth of understanding across protocol design, state-machine
semantics, fault tolerance, and the operational realities of a very large distributed fabric
that neither feature work nor operational experience alone produces.
The foundations are CAP-theorem reasoning, multi-phase commit semantics, protocol fault tolerance,
and inter-node communication-cost optimization inside massive fabrics.
Amazon Web Services — Software Engineer
Apr 2024 – Present · Santa Clara, CA
AWS DC Network Infra — Scalable Intent-Driven Routing (SIDR) · Rust, Python, Tokio
- Designed and built an end-to-end automated release qualification framework for SIDR with hierarchical concurrent workflow engines, comprehensive protocol state tracing, and chaos fault injection (partitions, device failures, interface flaps) to validate consensus convergence and MPC operation correctness with rollback under extreme stress and chaotic network / software failure conditions. Integrated into CI/CD pipeline, achieving 15× speedup in release qualification and qualifying all subsequent SIDR production releases.
- Delivered SIDR daemon logic for network cross-protocol route redistribution, including inter-process communication, message-stream parsing, multi-module asynchronous programming and OS-signaling, and optimized message generation and MPC lazy-init in multiple SIDR modules for improved efficiency.
- Delivered SIDR protocol security enhancement through message authentication and verification mechanisms, with thorough design and implementation for service-fabric workflows, race-condition / failures handling, and comprehensive end-to-end integration tests spanning controller-to-daemon intent distribution, system-level signaling, and protocol state-machine transitions under various network and system conditions.
XPeng Motors — Computer Vision Engineer Intern
Oct 2023 – Mar 2024 · San Diego, CA
Autonomous Driving Center · Python, PyTorch, DALI, ONNX
- Training Pipeline Acceleration: Integrated NVIDIA DALI for GPU-based online augmentation on huge-scale image datasets, offloading preprocessing from CPU to GPU via multi-process pipelines — 7× training speedup, 80% CPU reduction.
- Multi-task Backbone Consolidation: Merged task-specific perception models into a unified shared backbone; systematically explored FLOPs / cross-task generalization trade-offs to cut on-car scheduling and memory overhead while preserving per-task accuracy.
- Eye-Action Video Classification for DMS: Owned the end-to-end pipeline (dataset, temporal model, in-vehicle validation) — 99.64% accuracy, 30% latency reduction under real-time on-car constraints.
- Simulation-Driven Long-tail Object Detection: Replenished an OD dataset with photorealistic simulation for rare categories; validated via YOLO-X showing consistent mAP gains. Co-author of Anything in Any Scene.
ByteDance — Video Algorithms Engineer Intern
Nov 2021 – Apr 2022 · Shenzhen, China
Real-Time Communication, Video Group · Python, PyTorch
- Real-time Multi-frame Super-Resolution for TikTok live-streaming RTC: novel low-level encoder/decoder auxiliary modules leveraging temporal consistency and residual-aware fusion — 43% PSNR gain improvement, shipped.
- Robust Facial Landmark Detection for ROI-aware bitrate allocation: facial-parsing preprocessing, weighted loss + balanced resampling for long-tail poses, global-context branch — 67% NME reduction; unstructured pruning further -20% inference time.
- Built FFmpeg-based offline augmentation simulating codec artifacts and a multi-threaded concurrent I/O queue hiding I/O behind GPU compute — 40% training time reduction.
Amazon — SDE Intern, AWS CloudFront
Jun 2023 – Sep 2023 · Seattle, WA
CloudFront Function (CF2) Tagging in Control Plane · Java, Kotlin, AWS
- In a micro-service / distributed-transaction setting, designed unique-ID based tagging, analyzed race conditions and concurrent failures, and handled them with synchronous DB erasing calls, cleaner threads, and DynamoDB distributed locks.
- Optimized 3 customer CF2 APIs — 25% latency reduction by eliminating redundant RPC round-trips; shortened tagging cleaner lists for 98.5% RPS reduction and 60× cleaning speed.
- Extended AWS CloudFormation interface for CF2 tagging with async process + callbacks + Factory pattern; comprehensive integration tests covering concurrent race-conditions.
DMAI — Computer Vision Engineer Intern
Jul 2021 – Oct 2021 · Guangzhou, China
DMAI Research Center · Python, PyTorch
- AILA Preschool Learning System card recognition: benchmarked / optimized lightweight detectors (RFB, YOLO-X, YOLO-v5) with augmentation search and loss tuning — 99.5% mAP; open-set loss resolved 95% edge-case failures at 99% precision.
My research spans semantic segmentation, knowledge-distillation-based continual learning,
classification, out-of-distribution detection, and transfer learning. Core methodological
theme: what a visual encoder produces is only half the information — how it attends,
relates, and abstracts is equally transferable and equally worth distilling.
Selected top-cited contributions
-
SATS: Self-Attention Transfer for Continual Semantic Segmentation. 1st author.
Pattern Recognition, 2023 · 53 citations.
Continual semantic segmentation suffers from catastrophic forgetting when a model is incrementally
trained on new classes while retaining old ones. SATS introduces a lightweight self-attention
transfer scheme that distills the inter-patch relationship structure from
the old model's self-attention maps to the new model during incremental steps — transferring how
the visual encoder attends between patches, not just what features it produces. The technique is a
plug-in for any vision-transformer-based backbone, sets state of the art on standard
continual-segmentation benchmarks, and is the method I'm now extending from CNN-era continual
segmentation into modern VLM distillation.
-
Classifier-head Informed Feature Masking and Prototype-based Logit Smoothing for Out-of-Distribution
Detection. 2nd author. IEEE TCSVT, 2024 · 23 citations.
Post-hoc OOD detection combining classifier-head-informed feature masking
(classifier-head weights mask activations not tied to any in-distribution class) with
prototype-based logit smoothing (class prototypes regularize logits for off-manifold
samples).
-
Topic Driven Adaptive Network for Cross-Domain Sentiment Classification (TDAN). 2nd author.
Information Processing and Management, 2023 · 23 citations.
Cross-domain sentiment classification via a neural topic model plus a
keyword branch, fused by cross dot-product attention between topic
and keyword representations — grounding predictions in keyword-occurrence signals that transfer across
source and target domains.
Other publications
Anything in Any Scene: Photorealistic Video Object Insertion
Co-author · Preprint, arXiv:2401.17509, 2024 · 300+ GitHub stars · 12 citations · Presented at EI 2025 Highlights Session (IS&T Electronic Imaging)
Deep Model Reference: Simple Yet Effective Confidence Estimation for Image Classification
2nd author · MICCAI, 2024
Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection
3rd author · Neurocomputing, 2026 · 1 citation · ScienceDirect ↗
Local Background Features Matter in Out-of-Distribution Detection
3rd author · Under review at Neural Computation
Full list on Google Scholar →