AUTHOR=Venkataramanan Revathy , Padhee Swati , Rao Saini Rohan , Kaoshik Ronak , Sundara Rajan Anirudh , Sheth Amit TITLE=Ki-Cook: clustering multimodal cooking representations through knowledge-infused learning JOURNAL=Frontiers in Big Data VOLUME=Volume 6 - 2023 YEAR=2023 URL=https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2023.1200840 DOI=10.3389/fdata.2023.1200840 ISSN=2624-909X ABSTRACT=Cross-modal recipe retrieval has gained prominence due to its ability to retrieve a text representation given an image representation and vice versa. Clustering these recipe representations based on similarity is essential to retrieve relevant information about unknown food images. Existing works cluster similar recipe representations in the latent space based on class names. Due to inter-class similarity and intraclass variation, associating a recipe with a class name does not provide sufficient knowledge about recipes to determine similarity. On the other hand, recipe title, ingredients, and cooking actions provide detailed knowledge about recipes and are a better determinant of similar recipes. In this work, we utilize this additional knowledge of recipes such as ingredients and title to identify similar recipes, especially emphasizing attention to rare ingredients. In order to incorporate this knowledge, we propose a knowledge-infused multi-modal cooking representation learning network, Ki-Cook, built on the procedural attribute of the cooking process. To the best of our knowledge, this is the first work to adopt a comprehensive recipe similarity determinant to identify and cluster similar recipe representations. The proposed network also incorporates ingredient images to learn multimodal cooking representation. Since the motivation for clustering similar recipes is to retrieve relevant information for an unknown food image, we evaluate on the ingredient retrieval task. Empirical analysis shows that our proposed model improves Coverage of Ground Truth by 12% and Intersection Over Union by 10% compared to the baseline models. On average, the representations learned by our model contain 15.33% more rare ingredients compared to the baseline models. Due to this, our qualitative evaluation shows a 39% improvement in clustering similar recipes in the latent space compared to the baseline with 0.35 inter-annotator agreement of Fleiss Kappa's score.