Understanding and analyzing the behavior of robotic systems is essential to ensure their reliability, efficiency, and continuous improvement, especially as robots are increasingly deployed in complex, dynamic environments. Process mining offers a powerful approach to uncover and analyze the execution of robotic operations. However, applying process mining to robotic systems requires bridging the gap between fine-grained multimodal data and high-level activity representations. Recent advances in foundation models provide a promising solution to this challenge, as the knowledge acquired during their extensive pretraining enables them to interpret multimodal data without the need for task-specific training. In this work, we propose a novel multimodal process mining pipeline that leverages the zero-shot capabilities of foundation models to perform activity recognition from visual and auditory inputs. By transforming fine-grained multimodal data into event logs, the pipeline enables the application of process mining techniques to robotic systems. We applied our approach to the Baxter UR5 95 Objects dataset, which offers synchronized video and audio recordings of a Baxter robot manipulating objects. The fusion of activity recognition results from these complementary modalities yields an event log that more accurately represents the robot’s operations, mitigating imprecision associated with using a single modality. Our results demonstrate that foundation models effectively enable the application of process mining to robotic systems, facilitating monitoring and analysis of their behavior.

Multimodal Zero-Shot Activity Recognition for Process Mining of Robotic Systems

Pettinari Sara;
2026-01-01

Abstract

Understanding and analyzing the behavior of robotic systems is essential to ensure their reliability, efficiency, and continuous improvement, especially as robots are increasingly deployed in complex, dynamic environments. Process mining offers a powerful approach to uncover and analyze the execution of robotic operations. However, applying process mining to robotic systems requires bridging the gap between fine-grained multimodal data and high-level activity representations. Recent advances in foundation models provide a promising solution to this challenge, as the knowledge acquired during their extensive pretraining enables them to interpret multimodal data without the need for task-specific training. In this work, we propose a novel multimodal process mining pipeline that leverages the zero-shot capabilities of foundation models to perform activity recognition from visual and auditory inputs. By transforming fine-grained multimodal data into event logs, the pipeline enables the application of process mining techniques to robotic systems. We applied our approach to the Baxter UR5 95 Objects dataset, which offers synchronized video and audio recordings of a Baxter robot manipulating objects. The fusion of activity recognition results from these complementary modalities yields an event log that more accurately represents the robot’s operations, mitigating imprecision associated with using a single modality. Our results demonstrate that foundation models effectively enable the application of process mining to robotic systems, facilitating monitoring and analysis of their behavior.
2026
9783032029355
9783032029362
Foundation Models, Activity Recognition, Process Mining, Robotic Systems
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12571/36125
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact