Special Sessions

Time

TBA

Room

TBA

Organizers

  • Liu Zhenguang, Zhejiang University, China
  • Ji Zhang, University of Southern Queensland, Australia
  • Hao Huang, Wuhan University, China

Abstract

To intelligently interact with humans, artificial intelligence is required to have a sound understanding upon human videos and images. In daily life and working scenes, intelligent systems are designed to analyse human’s interactions with surrounding environments. Typical visual understanding tasks involve human pose estimation, pedestrian tracking, action recognition, and motion prediction, etc. For individual human body, it is an emerging research direction to develop intelligent machines for physiological monitoring, medical image analysis and other health-caring tasks.

This special session seeks innovative papers that exploit novel technologies and solutions from both industry and academia on highly effective and efficient human-centered intelligent multimedia understanding. The list of possible topics includes, but not limited to:

  • human action recognition and motion prediction
  • human pose estimation
  • medical image analysis
  • knowledge-driven video understanding
  • causality-driven cross-media analysis
  • cross-modal knowledge analysis

Time

TBA

Room

TBA

Organizers

  • Zhedong Zheng, National University of Singapore
  • Linchao Zhu, University of Technology Sydney
  • Liang Zheng, Australian National University
  • Yi Yang, Zhejiang University
  • Tat-Seng Chua, National University of Singapore

Abstract

In this special session, we aim to bring together the latest advances in Responsible, Responsive, and Robust (3R) multimedia retrieval, and draw attention from both the academic and industrial communities. A Responsible system aims to protect users’ privacy and related rights; a Responsive system focuses on the efficient and effective feedback on million-scale input data; while a Robust system ensures reliability and reproducibility of predictions and avoids unnecessary fatal errors.

Responsible: The privacy of user data must be respected and used in beneficial contexts, especially the biometric data of users such as the facial images. There are two potential solutions to protect user privacy. One is to leverage the generated data by GAN or the synthetic data by the 3D engine for the model training, where the model does not need knowledge of user data. Another solution is to harness the Federated Learning with fewer demands to user’s privacy. These two approaches are still under-explored. Through the special session, we hope to provide an avenue for the community to discuss the development and draw attention to a responsible multimedia retrieval system.

Responsive: The increasing quantity of multimedia data also demands an efficient retrieval system, which can handle the million-scale input data in response time during user interactions. It remains unknown whether the learned representation by CNN, RNN and transformers is compatible with the traditional hashing approaches or other dimension reduction methods, like PCA. On the other hand, there is also the scientific question of efficient model design. The related techniques include automated machine learning (auto-ML), the model pruning for CNN, RNN and transformers, and the prompt-based learning to save the training or testing time.

Robustness: It remains a great challenge to train system to deal with the out-of-distribution data. Two cases in recent years show that the systems trained via blind data-driven approaches may lead to unexpected and undesirable results. In 2015, one commercial photo system labels the African Americans as gorillas, which raises great concerns about the racial discrimination and biases of such human recognition systems. Since the system is blindly trained by the data and hard to tune, one quick fix to alleviate model biases at that time was to remove the class of gorillas in the system. Similarly, in 2018, a self-driving car hits a pedestrian due to the mis-classification of the pedestrian with her bike, which may not be seen by the system during training. If both systems could consider uncertainty and learn more invariant causal factors, such accidents can be avoided. Therefore, we also want to promote discussions on viewpoint-invariant representation learning, domain adaptation, and long-tailed recognition for multimedia retrieval.

The list of possible topics includes, but is not limited to:

  • Responsible:
    • Federal Learning for Multimedia Applications
    • Synthetic Data / Generated Data for Representation Learning
    • Interpretable Multimedia Learning and Visualization Techniques
    • Human-Centered Multimedia Analysis
  • Responsive:
    • Model Compression / AutoML for Multimedia Retrieval
    • New Hashing Methods for Representation Learning
    • Prompt Learning for Multi-Domain Adaptation
    • Multimedia Computing on Edge Devices
  • Robust: New Evaluation Metrics and Benchmarks

Time

TBA

Room

TBA

Organizers

  • Xuemeng Song, Shandong University
  • Meng Liu, Shandong Jianzhu University
  • Yinwei Wei, National University of Singapore
  • Xiaojun Chan, RMIT University, Australia

Abstract

Thanks to the flourish of multimedia devices (e.g., smart mobile devices), recent years have witnessed the unprecedented growth of multimedia data in people’s daily life. For example, numerous multimedia data have emerged in both Internet, and accumulated in the video surveillance domain. Therefore, multimedia search, which aims to facilitate users finding the target multimedia data from a huge database, has gained increasing research attention. Early studies mainly allow users to give the simple keyword-based or image-based query. However, in the real world, the user’s intent may be rather complex and can be hardly expressed by some keywords or images. Accordingly, recent research efforts have been resorted to the multimedia search that enables more sophisticated queries, such as micro-facial expression, a long description sentence, a modification sentence plus a reference image, and even a multi-modal dialog. Although existing methods have achieved compelling progress, they still perform far from satisfactorily due to the challenging user intent understanding.

The goal of this special session is to call for a coordinated effort to promote the user intent understanding towards sophisticated multimedia search, showcase innovative methodologies and ideas, introduce large scale real systems or applications, as well as propose new real-world datasets and discuss future directions. We solicit manuscripts in all fields that shed light on the user intent understanding towards sophisticated multimedia search.

We believe the special session will offer a timely collection of research updates to benefit the researchers and practitioners working in the broad fields ranging from information retrieval, multimedia to machine learning. To this end, we solicit original research and survey papers addressing the topics listed below (but not limited to):

  • User intent understanding towards sentence-based target moment localization
  • User intent understanding towards user feedback-guided image/video retrieval
  • User intent understanding towards multi-modal dialog systems
  • User intent understanding towards interactive search scenarios
  • Data analytics and demo systems for multimedia search;
  • Multimedia search related datasets
  • Emerging new applications belonging to multimedia search

Time

TBA

Room

TBA

Organizers

  • Yang Yang, University of Electronic Science and Technology of China
  • Guoqing Wang, University of Electronic Science and Technology of China
  • Yi Bin, University of Electronic Science and Technology of China

Abstract

Understanding the world surrounding us involves processing information of multiple modalities – we see objects by eyes, feel texture by hands, communicate by language, and so on. In order for more general Artificial Intelligence (AI) to understand the world better, it needs to be able to interpret and reason about the multimodal messages. Representative researches include connecting vision and language for better media content understanding (e.g., cross-modal retrieval, image and video captioning, visual question answering etc.), fusing different physical sensors (e.g., camera, LiDAR, tactile sensation, audio etc.) for better perception of the world by intelligent agents such as self-driving cars and robots. Despite the publishing of a series of papers proposing novel designs of multimodal understanding frameworks in top conferences (e.g., CVPR, ICCV, ECCV, NeurIPS and ICLR) and top journals (e.g., IEEE TPAMI and IJCV), and the emergence of appealing applications leveraging multimodal sensors for more complete and accurate perception, some core challenges still remain unsolved including: how to learn computer interpretable descriptions of heterogeneous data from multiple modalities, how to represent the process of changing data from one modality to another, how to identify relations between elements from two or more different modalities, how to fuse information from two or more modalities to perform a prediction task, and finally how to transfer knowledge between modalities and their latent representations.

The goal of this special session is to encourage researchers to present high quality work and to facilitate effective discussions on the potential solutions to those challenges. Aimed at this, this special session expects submissions soliciting scientific and technical contributions regarding recent findings in theory, methodologies, and applications within the topic of multimodal intelligence. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged therefore we expect a positive response from both academic and industrial communities.

Potential Topics Include (but are not limited to):

  • Multimodal learning
  • Cross-modal learning
  • X-supervised learning for multimodal data
  • Multimodal data generation and sensors
  • Cross-modal adaptation
  • Embodied multimodal learning
  • Multimodal transfer learning
  • Multimodal applications (e.g., autonomous driving, robotics, etc.)
  • Machine Learning studies of unusual modalities

Time

TBA

Room

TBA

Organizers

  • Shuyuan Zhu, University of Electronic Science and Technology of China
  • Zhan Ma, Nanjing University
  • Wei Hu, Peking University
  • Giuseppe Valenzise, French Centre, National de la Recherche Scientifique

Abstract

Recent years have witnessed the emergence of point clouds (PC) as one of the most promising formats to realistically represent 3D objects and scenes, by a set of unstructured points sparsely distributed in a 3D space. Similar to 2D images or videos, efficient point cloud processing from acquisition to understanding is of great interest to enable applications such as augmented reality, autonomous driving, metaverse etc.

Different from well-structured pixels in 2D images, unstructured and irregularly-sampled 3D points can efficiently render 3D objects, but require additional efforts to find effective representations for subsequent computation. For example, leveraging neighborhood correlations is difficult for loosely connected 3D points, and finding discriminative features is also a challenging problem for point cloud understanding. Thus, this special session solicits novel ideas and contributions for point cloud acquisition, processing and understanding, including but not limited to:

  • Point cloud acquisition and rendering
  • Point cloud compression and communication
  • Point cloud understanding and vision tasks
  • Point cloud quality evaluation and modeling
  • Point cloud standards

Time

TBA

Room

TBA

Organizers

  • Yi Jin, Beijing Jiaotong University
  • Shaohua Wan, Zhongnan University of Economics and Law
  • Zan Gao, Qilu University of Technology
  • Michele Nappi, University of Salerno
  • Yu-Dong Zhang, University of Leicester

Abstract

Visual Question Answering (VQA) is a recent hot topic which involves multimedia analysis, computer vision (CV), natural language processing (NLP), and even a broader perspective of artificial intelligence, which has attracted significant interest from the machine learning, CV, and NLP communities. A VQA system takes as inputs an image and a free, open-ended question in the form of natural language about the image and generates an answer in the form of natural language as output. Such a functionality can be useful in a wide range of applications, such as analysis of surveillance image or video footage from camera networks in smart city environments, and semantic analysis of large photographic archives.

When we want a machine to answer a specific question about an image in natural language, we need to encode information about the content of the image and the meaning and intention of the question in a form that can be used by the machine to provide a reasonable answer. VQA relates to the broad context of AI technologies in multiple aspects: fine-grained recognition, object detection, behavior understanding, semantic scene analysis, and understanding of the semantics of free text. Because VQA combines notions from CV and NLP, a natural VQA solution can be obtained by integrating methodologies from these two subfields of AI. With recent developments in deep learning, a better understanding the high-level and fine-grained semantics of visual contents becomes possible. This special issue aims to provide a forum for scientists and engineers working in academia, industry and government to present their latest research findings and engineering experiences on VQA for Multimedia.

Time

TBA

Room

TBA

Organizers

  • Wei Cai, The Chinese University of Hong Kong, Shenzhen
  • Yonggang Wen, Nanyang Technological University, Singapore
  • Maha Abdallah, Université Pierre & Marie Curie

Abstract

With the rapid development of blockchain-based decentralized applications and human-computer interaction (HCI) techniques, the Metaverse has attracted numerous attention in both industry and academia recently, especially after Facebook changed its name to Meta. The unprecedented focus on and investment in the Metaverse will speed up the development and breakthrough of related technologies, which will produce a series of open multimedia research questions on the Metaverse, including user-generated multimedia content, multimedia toolkits, NFT artwork, blockchain games, and visualizations for the ecosystem. In this special session, we would like to provide a venue for researchers in this area to publish their recent discoveries and outcomes in a timely manner. Topics of interest include but are not limited to:

  • Toolkits and systems for user-generated contents in metaverse applications
  • Incentive mechanism design for user-generated contents in metaverse applications
  • Artificial intelligence for user-generated contents and other aspects of the metaverse
  • Novel human-computer interaction/interface for metaverse systems
  • Innovations in NFTs and crypto artwork in the metaverse
  • Innovations in decentralized games design for the metaverse
  • Visualization for the metaverse ecosystem
  • Social network visualization and analysis for the metaverse ecosystem
  • Emerging multimedia applications and services in the metaverse
  • High-performance interactive multimedia infrastructure solutions to support metaverse
  • Other multimedia research topics that are closely related to Metaverse systems

Time

TBA

Room

TBA

Organizers

  • Min-Chun Hu, National Tsing Hua University, Taiwan
  • Hung-Kuo Chu, National Tsing Hua University, Taiwan

Abstract

In recent years, multimedia technology has been widely used to analyze sports data and to aid the sports training process. Professional leagues such as MLB, NBA, and NHL introduced systems to track the players, analyze the performance of each player, summarize game highlights, and even predict the possibility of match-fixing. To enhance visual entertainment, broadcast companies have introduced multi-view synthesis and augmented reality technologies to interact with the audience in real-time. Moreover, with the advance of head-mounted display (HMD), trainers in different kinds of sports have started utilizing virtual reality technologies to improve the skills and mindset of the athletes. There is no doubt that multimedia technology has become indispensable to facilitate sports training and enhance game experience, which also brings tremendous business opportunities to the sports industry. In this special issue, we invite researchers from the domains of multimedia content analysis, sensor data analysis, virtual/augmented/mixed reality, and artificial intelligence to submit their work that can be applied in sports.

Areas of interest for this special session include but are not limited to:

  • Detection and tracking technology for sports related objects (e.g. player, ball, court)
  • Sports event detection
  • Highlight summarization
  • Player/team performance prediction
  • Sensor data analysis for sports training
  • Pose and action analysis for athletes
  • Multiview synthesis technology for sports videos
  • Video codec technology for sports video broadcasting
  • Virtual/augmented/mixed reality systems for sports training
  • Simulation of realistic athletic movements

Time

TBA

Room

TBA

Organizers

  • Chih-Chung Hsu, National Cheng Kung University
  • Li-Wei Kang, National Taiwan Normal University

Abstract

In Object detection has been important and challenging in computer vision area, which has been widely applied in several domains, such as autonomous driving, visual surveillance, human-machine interaction, and medical imaging. In recent, significant improvement in object detection has been achieved with the rapid development of deep learning. Deep learning is essentially beneficial by extracting high-level and complex abstractions as data representations relying on a hierarchical learning process. In realizing deep learning, supervised and unsupervised approaches for training deep architectures have also been empirically investigated based on the adoption of parallel computing facilities such as GPUs or CPU clusters. However, designing and training high-accuracy and low-complexity deep models for object detection is still challenging, especially for the applications of autonomous driving. This special session will focus on all aspects of deep learning-based object detection and tracking for autonomous driving, emphasizing network model design, learning, and compression. It aims to bring together leading researchers and practitioners working in the emerging area of deep learning, object detection, autonomous driving, and related topics with applications.

Time

TBA

Room

TBA

Organizers

  • Changsheng Li, Beijing Institute of Technology
  • Yinqiang Zheng, The University of Tokyo

Abstract

Research on autonomous driving has attracted widespread attention, due to the rapid development of deep neural network in recent years which significantly improves the accuracy of the perception and prediction and could benefit the planning module as well. With the rapid development and application of autonomous driving technology, multimodal autonomous driving has become an important research area. In application, autonomous vehicles need to operate in a variety of scenes and modes, such as highway versus urban road, lane keeping versus lane change and go straight versus take turns at crossroad. Thus, it is necessary to make multiple predictions conditioned on different possible modals instead of letting one single prediction to capture all possible modals which is unrealistic due to the internal uncertainty of drivers/cyclists/pedestrians’ goal and motion. Additionally, autonomous vehicles usually perform multi-task, such as object detection, semantic segmentation, tracking, prediction, and decision-making. As single data source is not enough to satisfy all tasks and scenes simultaneously, autonomous vehicles generally adopt multi-sensor configuration to provide multi-modal data. Besides, the multi-modal characteristics of autonomous driving are also reflected in the evaluation of the uncertainty in specific tasks, such as the reliability of object detection and the multimodality of trajectory prediction. How to make safe and efficient planning with multi-modal prediction is also an important topic. Although the research of automatic driving has made dramatic advances, multimodal automatic driving still needs further efforts to ensure the robustness and reliability of autonomous driving system, which is also the trend of the application of autonomous driving technology.

This special session focuses on multimodal autonomous driving, aims at seeking for exceptional submissions to deal with the key points and universal problems in the field of autonomous driving, especially those works related with deep learning methods. Topics of interest include, but are not limited to:

  • The acquisition and labeling of multi-modal dataset
  • The fusion of multi-sensor information
  • Scenario understanding with multi-sensor input
  • Advanced methods of multi-modal perception
  • Multimodal trajectory prediction
  • Novel criteria to evaluate the multi-modal characteristics
  • Collision-free and consistent multi-agent prediction with multimodal output
  • Planning methods under multimodal prediction output

Time

TBA

Room

TBA

Organizers

  • Guangwei Gao, Nangjing University of Posts and Telecommunications, China
  • Jing Xiao, Wuhan University, China
  • Liang Liao, National Institute of Informatics, Japan
  • Junjun Jiang, Harbin Institute of Technology, China
  • Juncheng Li, The Chinese University of Hong Kong, China
  • Shin'ichi Satoh, National Institute of Informatics, Japan

Abstract

Representation learning has always been an important research area in pattern recognition. A good representation of practical data is critical to achieve satisfactory performance. Broadly speaking, such presentation can be “intra- data representation” or “inter-data representation”. Intra-data representation focuses on extracting or refining the raw feature of data point itself. Representative methods range from the early-staged hand-crafted feature design (e.g. SIFT, LBP, HoG, etc.), to the feature extraction (e.g. PCA, LDA, LLE, etc.) and feature selection (e.g. sparsity- based and submodulariry-based methods) in the past two decades, until the recent deep neural networks (e.g. CNN, RNN, etc.). Inter-data representation characterizes the relationship between different data points or the structure carried out by the dataset. For example, metric learning, kernel learning and causality reasoning investigate the spatial or temporal relationship among different examples, while subspace learning, manifold learning and clustering discover the underlying structural property inherited by the dataset.

Above analyses reflect that representation learning covers a wide range of research topics related to pattern recognition. On one hand, many new algorithms on representation learning are put forward every year to cater for the needs of processing and understanding various practical multimedia data. On the other hand, massive problems regarding representation learning still remain unsolved, especially for the big data and noisy data. Thereby, the objective of this special issue is to provide a stage for researchers all over the world to publish their latest and original results on representation learning.

Topics include but are not limited to:

  • Metric learning and kernel learning
  • Probabilistic graphical models
  • Multi-view/Multi-modal learning
  • Applications of representation learning
  • Robust representation and coding
  • Deep learning
  • Domain transfer learning
  • Learning under low-quality media data

Time

TBA

Room

TBA

Organizers

  • Yi Cai, South China University of Technology
  • Zhenguo Yang, Guangdong University of Technology
  • Xudong Mao, Xiamen University

Abstract

Vision-and-Language research is an interesting area at the nexus of Computer Vision and Natural Language Processing, and has attracted rapidly growing attention from both communities. The general aims of holding this session are to provide a forum for reporting and discussing completed research that involves both language and vision and to enable NLP and computer vision researchers to meet, exchange ideas, expertise and technology, and form new research partnerships. Research involving both language and vision computing spans a variety of disciplines and applications, and goes back a number of decades. In a recent scene shift, the big data era has thrown up a multitude of tasks in which vision and language are inherently linked. The explosive growth of visual and textual data, both online and in private repositories by diverse institutions and companies, has led to urgent requirements in terms of search, processing and management of digital content. A variety of vision and language tasks, benchmarked over large-scale human-annotated datasets, have driven tremendous progress in joint multimodal representation learning. This session will focus on some of the recently popular tasks in this domain such as visual captioning, visual grounding, visual question answering and reasoning, multi-modal dialogue, text-to-image generation, image-text retrieval, and multi-modal knowledge graph, multi-modal pattern recognition. We will invite or choose from this conference the most representative papers in these areas and discuss key principles that epitomize the core challenges & opportunities in multi-modal understanding, reasoning, and generation.

Submission

Authors should prepare their manuscript according to the Guide for Authors of ICME available at Author Information and Submission Instructions.

Important Dates

Special Session paper submission deadline: December 12, 2021 [11:59 p.m. PST] December 22, 2021 [11:59 p.m. PST]

Special Session Chairs