Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

arXiv cs.CL Papers

Summary

A comprehensive survey examining image classification into high-level and abstract categories, clarifying the tacit understanding of high-level semantics in computer vision through multidisciplinary analysis of commonsense, emotional, aesthetic, and interpretative semantics. The paper identifies persistent challenges in abstract concept image classification and emphasizes the importance of hybrid AI systems for addressing complex visual reasoning tasks.

arXiv:2308.10562v2 Announce Type: cross Abstract: The field of Computer Vision (CV) is increasingly shifting towards "high-level" visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

Source: https://arxiv.org/html/2308.10562

Delfina Sol Martinez Pandiani [[email protected]](mailto:[email protected]) 0000-0003-2392-6300 (https://orcid.org/0000-0003-2392-6300)
University of Bologna
Department of Computer Science and Engineering (DISI)
Bologna, Italy

Centrum Wiskunde en Informatica
Human-Centered Data Analytics
Amsterdam, The Netherlands

Valentina Presutti
University of Bologna
Department of Modern Languages, Literatures and Cultures (LILEC)
Bologna, Italy

(2024)

###### Abstract

The field of Computer Vision (CV) is increasingly shifting towards "high-level" visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.

**Keywords:** abstract concepts, image classification, social values, cultural notions, visual sensemaking

††copyright: acmlicensed
††journalyear: 2024
††doi: XXXXXXX.XXXXXXX
††journal: JACM
††journalvolume: XX
††journalnumber: X
††article: XXX
††publicationmonth: 2

††ccs: Computing methodologies Visual content-based indexing and retrieval
††ccs: Computing methodologies Computer vision problems
††ccs: Applied computing Arts and humanities

## 1. Introduction

Visual imagery has historically been a potent medium for conveying both abstract and concrete ideas, a significance evident in the vast amount of images shared daily on social media (Edwards, 2014). This surge in visual content has fueled extensive research in Computer Vision (CV), primarily aimed at automating the indexing, retrieval, and management of visual data, with applications spanning disciplines like sociology, media studies, and psychology (Joo et al., 2014; Arnold and Tilton, 2019). CV's data-driven approach, treating images as data, has been pivotal, facilitated further by the recent deep learning (DL) paradigm shift, leading to significant achievements in tasks such as image classification, object detection, and image generation (Bagi et al., 2020).

The remarkable success of the Deep Learning (DL) paradigm in Computer Vision (CV) has led to more intricate demands, including the need for tools capable of replicating human-like perception at a "high semantic level" (Hussain et al., 2017). This includes using CV to classify images based on high-level notions, known as Abstract Concepts (ACs), which have proven instrumental in various tasks such as emotion classification (Cao et al., 2018; Mohammad and Kiritchenko, 2018a), political affiliation detection (Joo et al., 2014), beauty assessment (Gray et al., 2010), and personality trait inference (Segalin et al., 2017), all accomplished through raw visual data. However, explicit definitions of high-level visual semantics, particularly ACs, in machine vision are sparse. This lack of clarity, combined with the historical emphasis on physical object detection grounded in low-level feature analysis, often results in less impressive results in high-level semantic tasks compared to concrete object classes (Borghi and Binkofski, 2014). Additionally, these tasks are influenced by cultural contexts and human biases in perception, which redefine the depth of knowledge and understanding expected from CV models.

Our survey systematically reviews CV studies addressing the challenge of automatically classifying visual data based on high-level semantic units. We clarify what constitutes "abstract" or "high-level" semantics in the context of an image and identify CV tasks and automatic detection approaches related to these semantics. Focusing on abstract concept-based image classification (AC image classification), particularly in still images, we conduct a comprehensive overview of the state of the art. This includes:

1. **High-Level Semantic Units**: Identification and clustering of high-level semantic units, integrating insights from cognitive science, visual studies, art history, and computer science.
2. **High-Level CV Tasks**: Surveying of the CV landscape to identify and cluster tasks associated with high-level visual sensemaking, while examining common methodologies and datasets.
3. **AC Image Classification**: We conduct a detailed review of works dealing explicitly with AC image classification in still images.

This work is structured as follows. Section 2 provides an interdisciplinary examination and characterization of what constitutes "full" or "high-level" semantics in human visual understanding. In Section 3, the methodology employed to identify works related to high-level semantics in the CV field is described. Section 4 surveys and categorizes CV tasks and works associated with high-level visual understanding, facilitating the discovery of implicit CV research addressing ACs. In Section 5 we perform a thorough survey of CV-based works that research tasks analogous to AC image classification. Section 6 presents datasets potentially relevant to the AC image classification task. The implications and contributions of the survey are discussed in Section 7. Ultimately, Section 8 provides concluding remarks. More details are available and documented in a specialized GitHub repository.¹

¹ https://github.com/delfimpandiani/seeing_the_intangible. Access date: February 2024.

## 2. Defining High-Level Visual Semantics

### 2.1. Three-Tiered Semantics

**Figure 1. The three tiers of the visual semantics hierarchy.** Visual understanding is often depicted as a multi-layered process, revealing three distinct levels of semantics. The low-level involves raw or elemental features, while the mid-level encompasses individual objects, persons, and regions. In contrast, the high-level remains less defined and explored.

The concept that the perception and interpretation of visual meaning involve a multi-layered process is a shared perspective across various domains and applications, including cognitive science, CV, content-based image retrieval (CBIR), and visual studies. This multi-layered nature was emphasized in the seminal paper by Hare et al. (2006), which discussed Smeulders' idea of the "semantic gap" in CV (Smeulders et al., 2000). This paper also highlighted the common practice of referring to different strata of meaning within images, a concept that has been pivotal in CBIR. We delved deeper into several of these multi-layered approaches, drawing insights from works by Panofsky (1955), Shatford (1986), Greisdorf and O'Connor (2002), Eakins (2000), Jörgensen (2003), Hare et al. (2006), and Aditya et al. (2019).

This exploration revealed a general analogy wherein three semantic tiers are used to delineate the human visual understanding process: a "low-level," a "mid-level," and an "upper-" or "high-level" tier, corresponding to increasing complexity, variability, and subjectivity (see Figure 1). Most of these approaches represented these layers using a pyramid analogy to illustrate a hierarchical structure. Via a thorough analysis of the semantic elements assigned to each of the layers by each of the foundational works, we noted that there was a consensus in identifying and agreeing upon semantic units within both the low- and mid-level layers. However, this consensus did not extend to the topmost layer.

At the base, the "low-level" layer (depicted in light blue in Figure 1) encompasses raw or primitive features such as regions, edges, textures, colors, shapes, and textures. Moving up to the "mid-level" layer (depicted in light purple in Figure 1), this tier accommodates semantic entities like objects, persons, regions, and places. Much of CV research has centered on this layer, emphasizing object recognition and image segmentation. In contrast, the "high-level" layer of semantics (depicted in dark purple in Figure 1) remains less detailed and subject to less consensus. This topmost tier, often associated with the concept of "full semantics," lacks an explicit and consistent definition and characterization of what types of semantic units belong in it. Instead, there appears to be a tacit shared understanding of the kinds of content that may reside or be conceived within this layer. In our analysis, this layer emerged as both elusive and significant, akin to the "tip of an iceberg" regarding visual semantics, motivating our efforts to define it more precisely.

### 2.2. Tip of the Iceberg: Upper Visual Semantics

Images may be sought "on the basis of their holistic content or message, as opposed to the information embedded within them by dint of their depiction of certain features" (Enser and Enser, 1999, p. 39). Most work that attempts to name and characterize where and how such holistic content arises thus moves in a layered way further away from raw or primitive features, to arrive at the "highest" tier of the semantic pyramid, referred to with different names: iconological layer (Panofsky and Drechsel, 1955), higher level of understanding (Jörgensen, 2003), abstract content (Shatford, 1986), abstract attributes (Eakins, 2000), subjective beliefs (Greisdorf and O'Connor, 2002), higher level semantics (Aditya et al., 2019), or full semantics (Hare et al., 2006).

**Figure 2. Tip of the iceberg: a deeper characterization of the top level of the visual semantic pyramid.** Drawing from a multidisciplinary exploration of semantic entities associated with this upper semantic layer, we have identified four distinct clusters of knowledge.

Part of the difficulty of solidifying a cross-disciplinary shared understanding of high-level semantics is that, in comparison to the other levels, high-level understanding by humans is increasingly cognitively complex. Complex cognitive processes, including abstraction, metonymic conveyance, adumbration, impression, prototypical displacement (Greisdorf and O'Connor, 2002), connotation (Hare et al., 2006; Shatford, 1986), evocation, and synthetic intuition (Panofsky and Drechsel, 1955) are considered crucial tools for understanding visual semantics at this "high level" of abstraction. However, it is generally thought that it is practically hard to grasp them using typical automatic image understanding and indexing methods. As such, this highest level of abstraction in the interpretation of image meaning or content is seen as a "seemingly insurmountable obstacle" to the application of content-based image retrieval techniques (Enser and Enser, 1999).

In addition to cognitive complexity, subjectivity represents another challenging aspect when it comes to characterizing and automatically recognizing semantic units within this level. Shatford's widely cited insight encapsulates this notion succinctly: "...the delight and frustration of pictorial resources is that a picture can mean different things to different people" (Shatford, 1986, p. 42). Furthermore, a single picture can convey diverse meanings not only to various people but also to the same individual in different contexts or at different times of necessity. In line with this perspective, Greisdorf underscores the importance of interdisciplinary perspectives as a foundational approach for modeling the attributes of the human image cataloging process, because:

> Those attributes tend to elude the indexing/cataloging process by exceeding the image indexing threshold due to individual viewer cognitive displacement of objects and object features that give rise to disjunctive prototypes that viewers may associate with the objects included as part of the image composition. These adumbrative, impressionistic and abstractionist concepts that relate viewer to image need to be captured with some type of retrieval mechanism in order to enhance retrieval effectiveness for system users. (Greisdorf and O'Connor, 2002, p. 11)

To better comprehend and communicate about these abstract semantics, there is a need to precisely identify the semantic units that may belong to this layer and potentially characterize their interrelationships. Thus, we systematically reviewed the cited literature to provide a more detailed characterization of this apex of visual semantics (see Figure 2). We categorized the types of elements mentioned as belonging to high-level visual semantics into four distinct clusters of knowledge.

Similar Articles

Audio-Visual Intelligence in Large Foundation Models

Hugging Face Daily Papers

This survey paper provides a comprehensive review of audio-visual intelligence within large foundation models, establishing a unified taxonomy, synthesizing core methodologies, and outlining key datasets, benchmarks, and open research challenges.

Multimodal neurons in artificial neural networks

OpenAI Blog

OpenAI discovers multimodal neurons in CLIP that respond to the same concept across different modalities (visual, symbolic, textual), mirroring biological neurons and explaining the model's robustness on challenging vision tasks. This interpretability research provides insights into how vision-language models organize and represent abstract concepts.

Learning concepts with energy functions

OpenAI Blog

OpenAI presents a technique using energy functions to enable agents to learn and extract abstract concepts (visual, spatial, temporal, social) from tasks, then transfer these concepts to solve related tasks in different domains without retraining. The approach uses energy-based models with neural networks to perform both generation and recognition of concepts.