Yiqiao Qiu 邱奕乔

Computer Vision / ML Engineer Distributed Systems SDE

Currently Software Engineer at AWS (Datacenter Network Infra — Scalable Intent-Driven Routing). Previously Computer Vision Engineer Intern at XPeng Motors, ByteDance, and DMAI, and SDE Intern at Amazon CloudFront. UCSD MS CSE, GPA 3.93. 7 publications · 112 citations.

01 About

My work threads together two layers of the modern AI stack. On the model layer: computer-vision and multi-modal-LLM algorithm work across the full industrial lifecycle — model optimization and deployment, research publications, and ML-infrastructure implementation. Underneath: the massive-scale distributed networking and routing control plane running AI datacenter clusters — the infrastructure powering every training and inference workload in the modern AI ecosystem.

03 Skills

Programming Languages

Python · Rust · C / C++ · Java · Kotlin · shell · SQL

Machine Learning / Deep Learning

Efficient Industrial Model Optimization · Continual Learning · Model Distillation · LoRA fine-tuning · Transfer Learning · Supervised / Semi-Supervised

Computer Vision

Semantic Segmentation · Classification · Object Detection · Super-Resolution · Facial Landmark Detection · Scene Understanding · VQA · Anomaly Detection

Distributed Systems

Large-scale distributed systems · CAP theorem · multi-phase commit · protocol fault tolerance · inter-node communication-cost optimization

Networking

SDN · BGP · OSPF · Quagga zebra · IPv4 / IPv6 · TCP / UDP · Rust Netlink

OS / System

Linux Kernel · Rust Tokio async · gRPC · Docker

ML Infrastructures & Deployment

PyTorch · torchtitan · Pipeline Parallelism · Fully Sharded Data Parallelism (FSDP) · ONNX · NVIDIA DALI · llama.cpp · GGUF quantization

Cloud Services

AWS (DynamoDB · S3 · CloudWatch · CloudFront · CloudFormation)

04 Education

University of California, San Diego

Sep 2022 – Mar 2024

M.S. in Computer Science and Engineering · GPA 3.93 / 4.0

Sun Yat-sen University

Sep 2018 – Jun 2022

B.Eng. in Computer Science · Major GPA 3.94 / 4.0 (top 10%) · Overall GPA 3.8 / 4.0

05 Get in touch

Open to CV / ML Engineer and Distributed Systems SDE roles. Best reached by email — yiqiaoqiu@hotmail.com.

Story I

Computer Vision & Multi-Modal

Algorithm · Industrial Optimization · Deployment · Infrastructure

My ML work tells one story: take the efficient-model mindset — latency, memory, throughput — from classical CV into modern VLMs and the training systems beneath them.

Industrial efficient-model work

Across three internships I shipped production CV models covering the core task families — semantic segmentation, object detection, super-resolution, facial-landmark detection, image classification, and video classification — under strict latency / compute budgets.

VLM — fine-tuning, token compression, on-device deployment

Distilling Qwen2.5-VL 32B → 3B with LoRA rank-16 on DriveLM-nuScenes; integrated four visual-token compression methods (FasterVLM, PruMerge, PyramidDrop, and my own SATS-CRP — region-aware self-attention transfer). 4× token reduction (480 → 120) with DriveLM LoRA accuracy +1%; 16× extreme compression at only 2.4% degradation; BF16 → GGUF Q4_K_M quantization (3.9× smaller) deployed on a consumer RTX 4070 Ti via llama.cpp at 170 tokens/s, 142 ms TTFT.

ML Infra — Pipeline Parallelism on torchtitan

My fork of PyTorch torchtitan (branch 309b462) implements a Block Attention Residuals experiment composed with multi-stage pipeline parallelism (PP=4 + cache adapter smoke-green), an AttnRes subclass with a scaling-law config registry, silent-grad-loss detection with clone-on-capture, a grouped_mm + torch.compile throughput path, and a LLaVA-style multimodal scaffolding on top of a Kimi Linear (KDA / MLA / MLP) base.

Agentic RAG + LLM-agent result evaluation harness

An agentic RAG system over SQL + document corpora: LLM orchestrator delegating to specialized sub-agents via OpenAI-style function calling and a typed Evidence protocol; async sub-agent dispatch with per-agent timeouts and graceful partial-result synthesis; hybrid BM25 + dense retrieval with Reciprocal Rank Fusion over ~900 chunks from 6 SEC 10-K filings; SSE streaming of routing decisions, tool calls, and sub-agent reasoning. Paired with a 4-mode evaluation harness (fuzzy-numeric / entity-match / LLM-as-judge / deterministic slot-based component recall) that specifically targets list-coverage silent failures invisible to judge-only scoring. Full-pass correctness on a 10-question gold-labeled dev set.

Story II

Massive-scale Distributed Systems

Datacenter Cluster Networking Infrastructure · SIDR

I currently build SIDR — AWS's Scalable Intent-Driven Routing protocol and the routing control plane for AWS Datacenter network fabrics (a production fleet of 7,000+ switch nodes). Under SIDR, routing intent is expressed centrally and distributed / installed across the fleet via multi-phase commit (MPC) transactions, which makes correctness, failure-mode reasoning, and fleet-scale operational behavior the central engineering problem of the system. My work spans SIDR distributed networking protocol feature development, end-to-end ownership of SIDR release qualification, and production operation of the control plane.

Three headline contributions:

  1. Automated release-qualification framework with chaos fault injection. Designed and built the complete end-to-end automated release-qualification framework for SIDR, run against test clusters: a hierarchical concurrent workflow engine orchestrating multi-stage qualification runs, comprehensive protocol state tracing, and a chaos fault-injection system that systematically exercises partitions, device failures, and interface flaps to validate consensus convergence and MPC operation correctness — with rollback — under extreme stress and chaotic network / software failure conditions. Integrated into CI/CD; delivered a 15× speedup in release qualification. Since October 2024, every AWS ML Datacenter switch NetOS release has been qualified and gated by this framework — it has caught dozens of latent bugs both inside the SIDR protocol itself and at the interaction boundary between SIDR and other AWS NetOS processes before they could reach production, keeping every subsequent SIDR release highly reliable at production fleet scale.
  2. Core SIDR feature development.
    • SIDR route redistribution from Quagga. Cross-protocol route redistribution bridges legacy BGP / OSPF state (Quagga zebra) into SIDR's intent-driven route programming. Owned the SIDR-side daemon logic end-to-end: inter-process communication with Quagga, asynchronous message-stream parsing, multi-module async orchestration keeping redistribution off the critical path of intent commits, OS-level signaling, and MPC lazy-initialization optimizations across multiple SIDR modules.
    • SIDR protocol credential management and security enhancement. Added message-certificate-based authentication and verification end-to-end across controller-to-daemon intent distribution. Designed certificate issuance / rotation into SIDR's service-fabric workflows, explicit race-condition and failure handling during certificate transitions, and a comprehensive end-to-end integration-test suite covering controller-to-daemon intent signaling and protocol state-machine transitions. A cross-cutting change — every SIDR message touched — so backward-compatible rollout mattered as much as the cryptographic component.
  3. Deep system understanding from feature development, production operation, and debugging. Shipping SIDR features is one input; running the control plane in production and diagnosing the many race conditions that surface when MPC transactions interleave with real link and process failures — device crashes mid-commit, interface flaps, partial commits, rollback-vs-recommit interleavings — is another. The combination built a depth of understanding across protocol design, state-machine semantics, fault tolerance, and the operational realities of a very large distributed fabric that neither feature work nor operational experience alone produces.

The foundations are CAP-theorem reasoning, multi-phase commit semantics, protocol fault tolerance, and inter-node communication-cost optimization inside massive fabrics.

Chronology

Work Experience

Positions held, most-recent first

Amazon Web Services — Software Engineer

Apr 2024 – Present  ·  Santa Clara, CA

AWS DC Network Infra — Scalable Intent-Driven Routing (SIDR) · Rust, Python, Tokio

  • Designed and built an end-to-end automated release qualification framework for SIDR with hierarchical concurrent workflow engines, comprehensive protocol state tracing, and chaos fault injection (partitions, device failures, interface flaps) to validate consensus convergence and MPC operation correctness with rollback under extreme stress and chaotic network / software failure conditions. Integrated into CI/CD pipeline, achieving 15× speedup in release qualification and qualifying all subsequent SIDR production releases.
  • Delivered SIDR daemon logic for network cross-protocol route redistribution, including inter-process communication, message-stream parsing, multi-module asynchronous programming and OS-signaling, and optimized message generation and MPC lazy-init in multiple SIDR modules for improved efficiency.
  • Delivered SIDR protocol security enhancement through message authentication and verification mechanisms, with thorough design and implementation for service-fabric workflows, race-condition / failures handling, and comprehensive end-to-end integration tests spanning controller-to-daemon intent distribution, system-level signaling, and protocol state-machine transitions under various network and system conditions.

XPeng Motors — Computer Vision Engineer Intern

Oct 2023 – Mar 2024  ·  San Diego, CA

Autonomous Driving Center · Python, PyTorch, DALI, ONNX

  • Training Pipeline Acceleration: Integrated NVIDIA DALI for GPU-based online augmentation on huge-scale image datasets, offloading preprocessing from CPU to GPU via multi-process pipelines — 7× training speedup, 80% CPU reduction.
  • Multi-task Backbone Consolidation: Merged task-specific perception models into a unified shared backbone; systematically explored FLOPs / cross-task generalization trade-offs to cut on-car scheduling and memory overhead while preserving per-task accuracy.
  • Eye-Action Video Classification for DMS: Owned the end-to-end pipeline (dataset, temporal model, in-vehicle validation) — 99.64% accuracy, 30% latency reduction under real-time on-car constraints.
  • Simulation-Driven Long-tail Object Detection: Replenished an OD dataset with photorealistic simulation for rare categories; validated via YOLO-X showing consistent mAP gains. Co-author of Anything in Any Scene.

ByteDance — Video Algorithms Engineer Intern

Nov 2021 – Apr 2022  ·  Shenzhen, China

Real-Time Communication, Video Group · Python, PyTorch

  • Real-time Multi-frame Super-Resolution for TikTok live-streaming RTC: novel low-level encoder/decoder auxiliary modules leveraging temporal consistency and residual-aware fusion — 43% PSNR gain improvement, shipped.
  • Robust Facial Landmark Detection for ROI-aware bitrate allocation: facial-parsing preprocessing, weighted loss + balanced resampling for long-tail poses, global-context branch — 67% NME reduction; unstructured pruning further -20% inference time.
  • Built FFmpeg-based offline augmentation simulating codec artifacts and a multi-threaded concurrent I/O queue hiding I/O behind GPU compute — 40% training time reduction.

Amazon — SDE Intern, AWS CloudFront

Jun 2023 – Sep 2023  ·  Seattle, WA

CloudFront Function (CF2) Tagging in Control Plane · Java, Kotlin, AWS

  • In a micro-service / distributed-transaction setting, designed unique-ID based tagging, analyzed race conditions and concurrent failures, and handled them with synchronous DB erasing calls, cleaner threads, and DynamoDB distributed locks.
  • Optimized 3 customer CF2 APIs — 25% latency reduction by eliminating redundant RPC round-trips; shortened tagging cleaner lists for 98.5% RPS reduction and 60× cleaning speed.
  • Extended AWS CloudFormation interface for CF2 tagging with async process + callbacks + Factory pattern; comprehensive integration tests covering concurrent race-conditions.

DMAI — Computer Vision Engineer Intern

Jul 2021 – Oct 2021  ·  Guangzhou, China

DMAI Research Center · Python, PyTorch

  • AILA Preschool Learning System card recognition: benchmarked / optimized lightweight detectors (RFB, YOLO-X, YOLO-v5) with augmentation search and loss tuning — 99.5% mAP; open-set loss resolved 95% edge-case failures at 99% precision.

Publications

Research

7 publications · 112 citations

My research spans semantic segmentation, knowledge-distillation-based continual learning, classification, out-of-distribution detection, and transfer learning. Core methodological theme: what a visual encoder produces is only half the information — how it attends, relates, and abstracts is equally transferable and equally worth distilling.

Selected top-cited contributions

Other publications

Anything in Any Scene: Photorealistic Video Object Insertion

Co-author · Preprint, arXiv:2401.17509, 2024 · 300+ GitHub stars · 12 citations · Presented at EI 2025 Highlights Session (IS&T Electronic Imaging)

Deep Model Reference: Simple Yet Effective Confidence Estimation for Image Classification

2nd author · MICCAI, 2024

Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection

3rd author · Neurocomputing, 2026 · 1 citation · ScienceDirect ↗

Local Background Features Matter in Out-of-Distribution Detection

3rd author · Under review at Neural Computation

Full list on Google Scholar →