|
Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
Xin Sun,
Xuan Wang,
Qiong Liu,
Xi Zhou,
paper
in IEEE Signal Processing Letters (SPL), 2024. (CCF-C)
Details
In this letter, we observe that previous studies often overlook the global context within video events.
To alleviate this, we create a two-dimensional map to generate multi-scale event proposals for both audio and visual modalities. Subsequently, we fuse audio and visual signals at both segment and event levels with a novel boundary-aware feature aggregation method, enabling the simultaneous capture of local and global information. To enhance the temporal alignment between the two modalities, we employ segment-level and event-level contrastive learning.
Our experiments consistently demonstrate the strength of our approach.
|
|
High-Compressed Deepfake Video Detection with Contrastive Spatiotemporal Distillation
Yizhe Zhu,
Chunuhi Zhang,
Jialin Gao,
Xin Sun,
Zihan Rui,
Xi Zhou,
paper
in Neurocomputing, 2023. (CCF-C)
Details
We propose a Contrastive SpatioTemporal Distilling (CSTD) approach that leverages spatial-frequency cues and temporal-contrastive alignment to improve high-compressed deepfake video detection. Our approach employs a two-stage spatiotemporal video encoder to fully exploit spatiotemporal inconsistency information. A fine-grained spatial-frequency distillation module is used to retrieve invariant forgery cues in spatial and frequency domains.
Additionally, a mutual-information temporal-contrastive distillation module is introduced to enhance the temporal correlated information and transfer the temporal structural knowledge from the teacher model to the student model.
We demonstrate the effectiveness and robustness of our method on low-quality high-compressed deepfake videos on public benchmarks.
|
|
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Chunuhi Zhang*,
Xin Sun*,
Li Liu,
Yiqian Yang,
Qiong Liu,
Xi Zhou,
Yanfeng Wang,
(* means equal contribution)
paper
/
code
in ACM International Conference on Multimedia (ACM MM), 2023. (CCF-A)
Details
We present a simple, compact and effective one-stream framework for VL tracking, namely All-in-One, which learns
VL representations from raw visual and language signals end-to-end in a unified transformer backbone.
The core insight is to is establish bidirectional information flow between well
aligned visual and language signals as early as possible.
We also develop a novel multi-modal alignment module incorporating cross-modal and intra-modal alignments to learn more reasonable VL representations.
Extensive experiments on multiple VL tracking benchmarks have demonstrated the effectiveness and generalization of our approach.
|
|
Exploiting Multi-modal Fusion for Robust Face Representation Learning with Missing Modality
Yizhe Zhu,
Xin Sun,
Xi Zhou,
paper
in International Conference on Artificial Neural Networks (ICANN), 2023. (CCF-C)
Details
We propose a multi-modal fusion framework that addresses the problem of uncertain missing modalities in face recognition.
Specifically, we first introduce a novel modality-missing loss function based on triplet hard loss to learn individual features for RGB, depth, and thermal modalities.
We then use a central moment discrepancy (CMD) based distance constraint training strategy to learn joint modality-invariant representations.
This approach fully leverages the characteristics of heterogeneous modalities to mitigate the modality gap, resulting in robust multi-modal joint representations.
|
|
Video Moment Retrieval via Comprehensive Relation-aware Network
Xin Sun,
Jialin Gao,
Yizhe Zhu,
Xuan Wang,
Xi Zhou,
paper
in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023. (CCF-B)
Details
This manuscript is a substantial extension of our RaNet in EMNLP2021 with several improvements (i.e. background suppression module, clip-level interaction, and IoU attention mechanism).
The background suppression module and IoU attention mechanism work together to enhance the hierarchical relations within the model by respectively modulating clip-level and moment-level features.
The clip-level interaction provides a complementary perspective by capturing localized visual information, resulting in a multi-granular perception of inter-modality information.
As a result, the harmonious integration of these modules significantly improves the model's ability to model comprehensive relations.
|
|
Efficient Video Grounding with Which-Where Reading Comprehension
Jialin Gao,
Xin Sun,
Bernard Ghanem,
Xi Zhou,
Shiming Ge,
paper
in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022. (CCF-B)
Details
We present an efficient framework in a fashion from which to where to facilitate video grounding. The core idea is imitating
the reading comprehension process to gradually narrow the decision space. The “which” step first roughly selects a candidate area by evaluating
which video segment in the predefined set is closest to the ground truth, while the “where” step aims to precisely regress the temporal boundary of
the selected video segment from the shrunk decision space. Extensive experiments demonstrate the effectiveness of our framework.
|
|
You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos
Xin Sun,
Xuan Wang,
Jialin Gao,
Qiong Liu,
Xi Zhou,
paper
/
code
in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022. (CCF-A)
Details
We formulate moment retrieval task from the perspective of multi-choice reading comprehension and propose a novel
Multi-Granularity Perception Network (MGPN) to tackle it. We integrate several human reading strategies
(i.e. passage question reread, enhanced passage question alignment, choice comparison) into our framework
and empower our model to perceive intra-modality and inter-modality information at a multi-granularity level.
Extensive experiments demonstrate the effectiveness and efficiency of our proposed MGPN.
|
|
Relation-aware Video Reading Comprehension for Temporal Language Grounding
Jialin Gao*,
Xin Sun*,
MengMeng Xu,
Xi Zhou,
Bernard Ghanem,
(* means equal contribution)
paper
/
code
in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. (CCF-B)
Details
We propose a novel Relation-aware Network (RaNet)
to address the problem of temporal language grounding in videos. We propose to interact the visual and textual modalities
in a coarse-and-fine fashion for token-aware and sentence-aware representation of each choice.
Further, a GAT layer is introduced to mine the exhaustive relations between multi-choices for better ranking.
Our model is efficient and outperforms the SOTA methods.
|
-
Conference Reviewer: ACM MM'23
-
Journal Reviewer: TCSVT
|