Yiqiao Qiu 邱奕乔

Computer Vision / ML Engineer • Distributed Systems SDE

Currently Software Engineer at AWS (Datacenter Network Infra — Scalable Intent-Driven Routing). Previously Computer Vision Engineer Intern at XPeng Motors, ByteDance, and DMAI, and SDE Intern at Amazon CloudFront. UCSD MS CSE, GPA 3.93. 7 publications · 115 citations.

Resume — CV / MLE ↓ Resume — Distributed Systems SDE ↓

01 About

My work threads together three layers of the modern AI stack. The model layer — computer-vision and multi-modal-LLM algorithm work across the full industrial lifecycle (XPeng / ByteDance / DMAI deployment, VLM distillation, 7 research papers). The ML-systems layer beneath it — Block Attention Residuals on torchtitan, multi-stage Pipeline Parallelism, an SGLang inference fork with Triton-fused kernels, and a production-scale GRPO RLHF loop. The datacenter layer below that — the massive-scale distributed networking and routing control plane (SIDR) running AWS AI training and inference clusters.

02 Overview

Story I

Computer Vision & Multi-Modal

Industrial CV at XPeng, ByteDance, and DMAI — segmentation, detection, super-resolution, landmark detection, video classification. 7 research papers on continual learning, distillation, and OOD. Current work: an end-to-end efficient VLM→VLA pipeline for driving (Qwen2.5-VL / Qwen3-VL — token compression, TensorRT-LLM deployment, GRPO RL) and Agentic RAG.

Read the full story →

Story II

Distributed Systems

SIDR — AWS's Scalable Intent-Driven Routing protocol for datacenter fabrics (7,000+ switch nodes). Feature development, release-qualification ownership, and production operation of the control plane.

Read the full story →

Story III

ML Infra

Block Attention Residuals on pytorch/torchtitan (RFC #3029, 12 phases) — multi-stage Pipeline Parallelism with a novel cross-stage cache adapter, Kimi Linear (KDA + MLA + MoE) port, multimodal LLaVA scaffolding, an SGLang inference fork with a fused Triton kernel, a 1,000-step GRPO RLHF run capturing 63K NCCL ops, and a Temporal AttnRes extension for long-video.

Read the full story →

Chronology

Experience

Currently Software Engineer at AWS (SIDR); previously Computer Vision Engineer Intern at XPeng Motors, ByteDance, and DMAI, and SDE Intern at Amazon CloudFront.

See timeline →

Publications

Research

7 publications · 115 citations across semantic segmentation, KD-based continual learning, classification, OOD detection, and transfer learning. 1st-author SATS paper in Pattern Recognition.

See publications →

03 Skills

Programming Languages

Python · Rust · C / C++ · Java · Kotlin · shell · SQL

Machine Learning / Deep Learning

Efficient Industrial Model Optimization · Model Distillation · Visual Token Compression · Vision-Language-Action (VLA) · GRPO RL post-training · Continual Learning · LoRA fine-tuning · Transfer Learning · Supervised / Semi-Supervised Learning

Computer Vision

Semantic Segmentation · Classification · Object Detection · Super-Resolution · Facial Landmark Detection · Scene Understanding · VQA · Anomaly Detection

Distributed Systems

Large-scale distributed systems · CAP theorem · multi-phase commit · protocol fault tolerance · inter-node communication-cost optimization

Networking

SDN · BGP · OSPF · Quagga zebra · IPv4 / IPv6 · TCP / UDP · Rust Netlink

OS / System

Linux Kernel · Rust Tokio async · gRPC · Docker

ML Infrastructures & Deployment

PyTorch · torchtitan · Pipeline Parallelism (Interleaved1F1B + cross-stage cache adapter) · FSDP / FSDP2 · SGLang inference · TensorRT-LLM (BF16 / FP8 / NVFP4) · CUDA Graph · Triton kernels · GRPO / PPO RLHF · Monarch · NCCL tracing · ONNX · NVIDIA DALI · llama.cpp · GGUF quantization

Cloud Services

AWS (DynamoDB · S3 · CloudWatch · CloudFront · CloudFormation)

04 Education

University of California, San Diego

Sep 2022 – Mar 2024

M.S. in Computer Science and Engineering · GPA 3.93 / 4.0

Sun Yat-sen University

Sep 2018 – Jun 2022

B.Eng. in Computer Science · Major GPA 3.94 / 4.0 (top 10%) · Overall GPA 3.8 / 4.0

05 Get in touch

Open to CV / ML Engineer, ML Infrastructure SDE (especially Distributed Training Framework), and Distributed Systems and Datacenter Networking SDE roles. Best reached by email — yiqiaoqiu@hotmail.com.

✉ yiqiaoqiu@hotmail.com in LinkedIn GS Google Scholar </> GitHub

Story I

Computer Vision & Multi-Modal

Algorithm · Industrial Optimization · Deployment · Infrastructure

My CV / multi-modal work tells one story: take the efficient-model mindset — latency, memory, throughput — from classical CV into modern VLMs and LLM-agent applications. The training-systems side of the same arc — Block Attention Residuals on torchtitan, multi-stage Pipeline Parallelism, the SGLang inference fork, and the GRPO RLHF loop — lives on its own page (Story III · ML Infra →).

Industrial efficient-model work

Across three internships I shipped production CV models covering the core task families — semantic segmentation, object detection, super-resolution, facial-landmark detection, image classification, and video classification — under strict latency / compute budgets.

XPeng Motors — Autonomous Driving Center. Python, PyTorch, NVIDIA DALI, ONNX · Oct 2023 – Mar 2024.
- Training-pipeline acceleration for large-scale perception models. Accelerated on-car perception training by integrating NVIDIA DALI for GPU-based online augmentation on huge-scale image datasets, offloading preprocessing from CPU to GPU via multi-process pipelines — 7× training speedup and 80% CPU reduction, unblocking faster iteration across perception teams.
- Multi-task backbone consolidation for on-car deployment. Merged multiple task-specific perception models into a unified shared backbone; systematically explored trade-offs across architectures, FLOPs, and cross-task generalization. Reduced on-car model scheduling and memory overhead while preserving per-task accuracy, improving deployment efficiency on resource-constrained automotive compute.
- Eye-action video classification for the Driver Monitoring System (DMS). Owned end-to-end development of the eye-action recognition pipeline for in-cabin fatigue and distraction detection — dataset construction, temporal model design and training, and in-vehicle real-scene validation under varied lighting and head poses. 99.64% binary classification accuracy with 30% inference-latency reduction, meeting the real-time on-car constraint.
- Simulation-driven data augmentation for long-tail object detection. Replenished an object-detection dataset with photorealistic simulation data for rare / long-tail categories; validated the pipeline by training YOLO-X on the augmented dataset and demonstrating consistent mAP gains on underrepresented classes. Co-author of Anything in Any Scene.
ByteDance — Real-Time Communication, Video Group. Python, PyTorch · Nov 2021 – Apr 2022.
- Real-time multi-frame super-resolution for live-streaming codec enhancement. Proposed novel auxiliary modules at the low-level encoder / decoder of a real-time multi-frame super-resolution model, leveraging temporal consistency across adjacent frames and residual-aware feature fusion to alleviate video-decoding blocky artifacts from aggressive compression — without inflating inference latency. 43% PSNR gain improvement in offline testing; deployed into the TikTok live-streaming RTC video engine.
- Robust facial-landmark detection for ROI-aware bitrate allocation. Optimized the landmark model driving ROI-based bitrate allocation on streamers' faces via (1) facial-parsing preprocessing for semantic priors, (2) weighted loss with balanced resampling for long-tail poses / occlusions, and (3) an auxiliary global-context branch stabilizing predictions under occlusion and motion blur. 67% NME reduction; unstructured pruning further cut inference time by 20% to fit the RTC latency budget.
- Training infrastructure. Built FFmpeg-based offline augmentation simulating codec artifacts, bitrate ladders, and frame-drop patterns, plus a multi-threaded concurrent I/O queue hiding I/O latency behind GPU compute — 40% training-time reduction on large-scale video datasets.
DMAI — Research Center. Python, PyTorch · Jul 2021 – Oct 2021. Benchmarked and optimized lightweight object detectors (RFB, YOLO-X, YOLO-v5) with data-augmentation search and loss tuning to suppress false positives (99.5% mAP); improved classification with open-set loss to resolve 95% of edge-case failures at 99% precision.

Efficient VLM → VLA for autonomous driving — full lifecycle

An end-to-end efficient vision-language project for driving QA and vision-language-action (VLA) trajectory planning, taking Qwen2.5-VL-3B and Qwen3-VL-4B across the full industrial arc: LoRA fine-tuning → visual-token compression → multimodal fusion → quantized / TensorRT deployment → GRPO RL post-training. Built on DriveLM-nuScenes (QIU023/DriveLM_VLM_Project, 30+ engineering phases).

LoRA fine-tuning + visual-token compression. LoRA (rank-16, α=32) on DriveLM-nuScenes (56.7% baseline on the full N=18,898 val set); integrated four token-compression methods between vision encoder and LLM — FasterVLM, PruMerge, PyramidDrop, and my own SATS-CRP (region-aware self-attention transfer). 4× reduction (480→120) loses zero accuracy (all three ≥ baseline; PruMerge +0.7%), and 16× extreme compression (480→30) costs only 2.4% — quantifying how redundant driving-scene visual tokens are.
On-device quantized deployment. BF16 → GGUF Q4_K_M (3.9× smaller, 1.8 GB) served on a consumer RTX 4070 Ti via llama.cpp at 170 tokens/s and 142 ms TTFT.
Multimodal full-modal VLA. Extended Qwen3-VL-4B to trajectory planning with OpenVLA-bin action tokens, fusing multi-camera video + HD-map BEV render + 3D-bbox & ego-state text serialization; verified clean 1-cam planning at L2 0.705. A pre-deploy eval-credibility audit caught a silent thumbnail-video downscaling bug contaminating the 3-cam line — and now gates every run behind a recite-the-audit checklist.
TensorRT-LLM deployment. Deployed the Qwen3-VL-4B VLA on TRT-LLM 1.3.0rc15 across three precisions (BF16 / FP8 / NVFP4) × FasterVLM 4×, all passing an HF top-5 parity gate; full multimodal TRT pipeline hits a 3.45× TTFT speedup vs HF, with real TensorRT ViT engines at 1.45× (bf16) / 1.96× (fp16). Profiling showed weight-only quant doesn't help single-stream short-prompt latency (BF16 fastest at 85.7 ms TTFT) but shrinks the checkpoint 14% — a concrete precision-vs-regime tradeoff result.
GRPO RL post-training. Group-relative policy optimization for the planning head on 8× RTX 5090 (FSDP), with a reward composing L2-to-ground-truth, UniAD ego-box collision rate, and progress shaping engineered to resist the "predict-zeros" reward-hacking attack.

Agentic RAG + LLM-agent result evaluation harness

An agentic RAG system over SQL + document corpora: LLM orchestrator delegating to specialized sub-agents via OpenAI-style function calling and a typed Evidence protocol; async sub-agent dispatch with per-agent timeouts and graceful partial-result synthesis; hybrid BM25 + dense retrieval with Reciprocal Rank Fusion over ~900 chunks from 6 SEC 10-K filings; SSE streaming of routing decisions, tool calls, and sub-agent reasoning. Paired with a 4-mode evaluation harness (fuzzy-numeric / entity-match / LLM-as-judge / deterministic slot-based component recall) that specifically targets list-coverage silent failures invisible to judge-only scoring. Full-pass correctness on a 10-question gold-labeled dev set.

Story II

Massive-scale Distributed Systems

Datacenter Cluster Networking Infrastructure · SIDR

I currently build SIDR — AWS's Scalable Intent-Driven Routing protocol (re:Invent 2023 · NET401-R) and the routing control plane for AWS Datacenter network fabrics (a production fleet of 7,000+ switch nodes). Under SIDR, routing intent is expressed centrally and distributed / installed across the fleet via multi-phase commit (MPC) transactions, which makes correctness, failure-mode reasoning, and fleet-scale operational behavior the central engineering problem of the system. My work spans SIDR distributed networking protocol feature development, end-to-end ownership of SIDR release qualification, and production operation of the control plane.

Three headline contributions:

Automated release-qualification framework with chaos fault injection. Designed and built the complete end-to-end automated release-qualification framework for SIDR, run against test clusters: a hierarchical concurrent workflow engine orchestrating multi-stage qualification runs, comprehensive protocol state tracing, and a chaos fault-injection system that systematically exercises partitions, device failures, and interface flaps to validate consensus convergence and MPC operation correctness — with rollback — under extreme stress and chaotic network / software failure conditions. Integrated into CI/CD; delivered a 15× speedup in release qualification. Since October 2024, every AWS ML Datacenter switch NetOS release has been qualified and gated by this framework — it has caught dozens of latent bugs both inside the SIDR protocol itself and at the interaction boundary between SIDR and other AWS NetOS processes before they could reach production, keeping every subsequent SIDR release highly reliable at production fleet scale.
Core SIDR feature development.
- SIDR route redistribution from Quagga. Cross-protocol route redistribution bridges legacy BGP / OSPF state (Quagga zebra) into SIDR's intent-driven route programming. Owned the SIDR-side daemon logic end-to-end: inter-process communication with Quagga, asynchronous message-stream parsing, multi-module async orchestration keeping redistribution off the critical path of intent commits, OS-level signaling, and MPC lazy-initialization optimizations across multiple SIDR modules.
- SIDR protocol credential management and security enhancement. Added message-certificate-based authentication and verification end-to-end across controller-to-daemon intent distribution. Designed certificate issuance / rotation into SIDR's service-fabric workflows, explicit race-condition and failure handling during certificate transitions, and a comprehensive end-to-end integration-test suite covering controller-to-daemon intent signaling and protocol state-machine transitions. A cross-cutting change — every SIDR message touched — so backward-compatible rollout mattered as much as the cryptographic component.
Deep system understanding from feature development, production operation, and debugging. Shipping SIDR features is one input; running the control plane in production and diagnosing the many race conditions that surface when MPC transactions interleave with real link and process failures — device crashes mid-commit, interface flaps, partial commits, rollback-vs-recommit interleavings — is another. The combination built a depth of understanding across protocol design, state-machine semantics, fault tolerance, and the operational realities of a very large distributed fabric that neither feature work nor operational experience alone produces.

The foundations are CAP-theorem reasoning, multi-phase commit semantics, protocol fault tolerance, and inter-node communication-cost optimization inside massive fabrics.

Story III

ML Infra

Training · Pipeline Parallelism · Inference · RLHF

My ML-systems work covers the full LLM lifecycle on a single project: algorithm-architecture co-design (Block Attention Residuals) → large-scale pretraining (FSDP + multi-stage Pipeline Parallelism) → multimodal continued-pretrain & SFT → inference acceleration (SGLang fork + Triton-fused kernels + CUDA Graph) → RLHF (production-scale GRPO/PPO via SGLang generation). All of it lives in QIU023/torchtitan_attention_residual (220+ commits across 12 phases) and the companion inference fork QIU023/sglang@attention_residual_inference. RFC filed upstream as pytorch/torchtitan#3029.

Block Attention Residuals — algorithm-architecture co-design

Independent end-to-end reproduction of Block Attention Residuals (Kimi Team, arXiv:2603.15031) as a contribution to pytorch/torchtitan. The algorithm replaces fixed residual accumulation with softmax attention over block outputs — partitioning layers into N blocks (~8 recommended), running standard residuals inside blocks and firing attention-based aggregation only at boundaries. Cross-stage pipeline-parallel traffic drops from O(L·d) to O(N·d), making the algorithm directly compatible with multi-stage PP — this is the load-bearing engineering insight that motivated the entire project.

Phase 2 — Single-GPU FSDP A/B (174M dense Llama3)

Headline finding reproduces the paper's Table 1 trend; AttnRes is consistently below baseline at every milestone under matched hyperparameters and architecture:

Step 500 — baseline 6.1412 vs AttnRes 6.0146 (Δ −0.1265).
Step 10,000 — baseline 4.3235 vs AttnRes 4.2192 (Δ −0.1043).

Supports the paper's ~1.25× effective-compute-gain claim. RFC evidence for upstream review.

Phase 3 — novel PP cross-stage cache adapter for Interleaved1F1B

The central engineering question of porting block-level attention into pipeline parallelism: AttnRes aggregation at block boundaries needs the previous block's outputs, but in Interleaved1F1B those outputs live on a different rank and have already moved out of the PP buffer. I designed a cross-stage cache adapter that materializes block-boundary outputs into a per-rank cache, replays them at the right microbatch index, and gates eviction against the schedule's commit semantics. Validation on PP=4, V=2:

Naive vs. adapter loss delta within the nondeterminism band (max ~0.06).
Memory overhead: +260 MB cache on stage-boundary ranks, zero on internal ranks.
41/41 CPU unit tests passing; smoke-green on real cluster runs.

Phase 4 — Kimi Linear (KDA + MLA + MoE) port + AttnRes + PP

Full integration of Key-Derived Attention (KDA), Multi-head Latent Attention (MLA), and an MoE backbone underneath the AttnRes block scheme; 436M FSDP architecture-validation checkpoint at 12.5k steps on C4-equivalent data (explicitly framed as architecture-validation grade, ~0.35% of the paper's 119B-token budget); a canonical aligned 447M Kimi Linear AttnRes pretrain under FSDP=8.

Phase 5 / 6 — Multimodal scaffolding

A LLaVA-style projector on top of the Kimi Linear AttnRes base; FSDP=8 multimodal launcher; 447M aligned multimodal continued-pretrain (2,500 steps) + SFT (500 steps) reusing LLaVA-Pretrain captions (overnight orchestrator simplified to drop COCO since SFT shares the same caption pool); multimodal GRPO entry point with a LLaVA caption task as the bridge into Phase 11.

Phase 11 — SGLang inference fork

Companion fork QIU023/sglang@attention_residual_inference: CUDA Graph integration for Kimi-Linear and Qwen3 (with cuda-graph compat fix for static cleanup), real-checkpoint end-to-end inference, chunked-prefill, and a fused Phase-2 Triton kernel for AttnRes block aggregation. Long-context AttnRes bench includes a 24K-context data point. The profiling report surfaced (and fixed) a v2-bench bug where the Triton kernel was never firing — the kind of silent-failure regression that judge-only benchmarking misses.

Phase 11 / RLHF — production GRPO loop

End-to-end GRPO/PPO via SGLang generation, scaling up production runs through the fleet:

sum-digits production runs at 50 / 200 / 500 steps demonstrating real learning under kl_coef=0.05 with a frozen reference model.
1,000-step definitive run capturing 63,483 NCCL ops; 500-step run captured 51K.
PolicyTrainer relaxation, lead/follower + provisioner shared-mode topology, Monarch worker spawn fix (pre-import torch + torchtitan in the worker before Monarch dispatches), NCCL trace wrapper for fabric-level tracing, varlen_attn SDPA fallback unblock.

Phase 12 — Temporal AttnRes for long-video

Extending Block Attention Residuals along the temporal axis: a long-video cross-frame compression ablation and a Temporal AttnRes proposal that applies the same softmax-over-block-outputs idea to cross-frame aggregation, plus a video / BEV perception extension plan. The natural bridge between the ML-systems work here and the multimodal VLA work on Story I.

Upstream contributions

Beyond RFC pytorch/torchtitan#3029, the work surfaced and filed a batch of upstream fixes — including three SGLang branches and a flash-linear-attention fused_norm_gate patch for an sm_120 (RTX 5090) KDA crash — discovered while bringing Kimi-Linear up on current hardware.

Cross-cutting infra in the fork

Multi-stage pipeline parallelism (PP=4 with the cross-stage cache adapter).
AttnRes subclass with a scaling-law config registry covering the model-size sweep.
Silent-grad-loss detection with clone-on-capture — guards against the activation / gradient aliasing bugs that pipeline-parallel debugging is full of.
grouped_mm + torch.compile throughput path.
git-LFS-backed CSV.gz NCCL traces; vast.ai env-compat patch for cross-host reproducibility.

Chronology

Work Experience

Positions held, most-recent first

Amazon Web Services — Software Engineer

Apr 2024 – Present · Santa Clara, CA

AWS DC Network Infra — Scalable Intent-Driven Routing (SIDR) · Rust, Python, Tokio

Designed and built an end-to-end automated release qualification framework for SIDR with hierarchical concurrent workflow engines, comprehensive protocol state tracing, and chaos fault injection (partitions, device failures, interface flaps) to validate consensus convergence and MPC operation correctness with rollback under extreme stress and chaotic network / software failure conditions. Integrated into CI/CD pipeline, achieving 15× speedup in release qualification and qualifying all subsequent SIDR production releases.
Delivered SIDR daemon logic for network cross-protocol route redistribution, including inter-process communication, message-stream parsing, multi-module asynchronous programming and OS-signaling, and optimized message generation and MPC lazy-init in multiple SIDR modules for improved efficiency.
Delivered SIDR protocol security enhancement through message authentication and verification mechanisms, with thorough design and implementation for service-fabric workflows, race-condition / failures handling, and comprehensive end-to-end integration tests spanning controller-to-daemon intent distribution, system-level signaling, and protocol state-machine transitions under various network and system conditions.

XPeng Motors — Computer Vision Engineer Intern

Oct 2023 – Mar 2024 · San Diego, CA

Autonomous Driving Center · Python, PyTorch, DALI, ONNX

Training Pipeline Acceleration: Integrated NVIDIA DALI for GPU-based online augmentation on huge-scale image datasets, offloading preprocessing from CPU to GPU via multi-process pipelines — 7× training speedup, 80% CPU reduction.
Multi-task Backbone Consolidation: Merged task-specific perception models into a unified shared backbone; systematically explored FLOPs / cross-task generalization trade-offs to cut on-car scheduling and memory overhead while preserving per-task accuracy.
Eye-Action Video Classification for DMS: Owned the end-to-end pipeline (dataset, temporal model, in-vehicle validation) — 99.64% accuracy, 30% latency reduction under real-time on-car constraints.
Simulation-Driven Long-tail Object Detection: Replenished an OD dataset with photorealistic simulation for rare categories; validated via YOLO-X showing consistent mAP gains. Co-author of Anything in Any Scene.

ByteDance — Video Algorithms Engineer Intern

Nov 2021 – Apr 2022 · Shenzhen, China

Real-Time Communication, Video Group · Python, PyTorch

Real-time Multi-frame Super-Resolution for TikTok live-streaming RTC: novel low-level encoder/decoder auxiliary modules leveraging temporal consistency and residual-aware fusion — 43% PSNR gain improvement, shipped.
Robust Facial Landmark Detection for ROI-aware bitrate allocation: facial-parsing preprocessing, weighted loss + balanced resampling for long-tail poses, global-context branch — 67% NME reduction; unstructured pruning further -20% inference time.
Built FFmpeg-based offline augmentation simulating codec artifacts and a multi-threaded concurrent I/O queue hiding I/O behind GPU compute — 40% training time reduction.

Amazon — SDE Intern, AWS CloudFront

Jun 2023 – Sep 2023 · Seattle, WA

CloudFront Function (CF2) Tagging in Control Plane · Java, Kotlin, AWS

In a micro-service / distributed-transaction setting, designed unique-ID based tagging, analyzed race conditions and concurrent failures, and handled them with synchronous DB erasing calls, cleaner threads, and DynamoDB distributed locks.
Optimized 3 customer CF2 APIs — 25% latency reduction by eliminating redundant RPC round-trips; shortened tagging cleaner lists for 98.5% RPS reduction and 60× cleaning speed.
Extended AWS CloudFormation interface for CF2 tagging with async process + callbacks + Factory pattern; comprehensive integration tests covering concurrent race-conditions.

DMAI — Computer Vision Engineer Intern

Jul 2021 – Oct 2021 · Guangzhou, China

DMAI Research Center · Python, PyTorch

AILA Preschool Learning System card recognition: benchmarked / optimized lightweight detectors (RFB, YOLO-X, YOLO-v5) with augmentation search and loss tuning — 99.5% mAP; open-set loss resolved 95% edge-case failures at 99% precision.

Publications

Research

7 publications · 115 citations

My research spans semantic segmentation, knowledge-distillation-based continual learning, classification, out-of-distribution detection, and transfer learning. Core methodological theme: what a visual encoder produces is only half the information — how it attends, relates, and abstracts is equally transferable and equally worth distilling.

Selected top-cited contributions

SATS: Self-Attention Transfer for Continual Semantic Segmentation. 1st author. Pattern Recognition, 2023 · 55 citations. Continual semantic segmentation suffers from catastrophic forgetting when a model is incrementally trained on new classes while retaining old ones. SATS introduces a lightweight self-attention transfer scheme that distills the inter-patch relationship structure from the old model's self-attention maps to the new model during incremental steps — transferring how the visual encoder attends between patches, not just what features it produces. The technique is a plug-in for any vision-transformer-based backbone, sets state of the art on standard continual-segmentation benchmarks, and is the method I'm now extending from CNN-era continual segmentation into modern VLM distillation.
Classifier-head Informed Feature Masking and Prototype-based Logit Smoothing for Out-of-Distribution Detection. 2nd author. IEEE TCSVT, 2024 · 23 citations. Post-hoc OOD detection combining classifier-head-informed feature masking (classifier-head weights mask activations not tied to any in-distribution class) with prototype-based logit smoothing (class prototypes regularize logits for off-manifold samples).
Topic Driven Adaptive Network for Cross-Domain Sentiment Classification (TDAN). 2nd author. Information Processing and Management, 2023 · 23 citations. Cross-domain sentiment classification via a neural topic model plus a keyword branch, fused by cross dot-product attention between topic and keyword representations — grounding predictions in keyword-occurrence signals that transfer across source and target domains.

Other publications

Anything in Any Scene: Photorealistic Video Object Insertion

Co-author · Preprint, arXiv:2401.17509, 2024 · 300+ GitHub stars · 13 citations · Presented at EI 2025 Highlights Session (IS&T Electronic Imaging)

Deep Model Reference: Simple Yet Effective Confidence Estimation for Image Classification

2nd author · MICCAI, 2024

Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection

3rd author · Neurocomputing, 2026 · 1 citation · ScienceDirect ↗

Local Background Features Matter in Out-of-Distribution Detection

3rd author · Under review at Neural Computation

Full list on Google Scholar →