Researchers have uncovered new insights into how modern language models (LMs) organize and process information across different languages and modalities.
Study: The Semantic Hub hypothesis: Language models share semantic representations across languages and modalities. Image Credit: Jamie Jin/Shutterstock.com
In the study, the team has proposed the "semantic hub" hypothesis, which suggests that LMs create a shared representation space where semantically similar inputs—whether in text, code, or even audio—naturally cluster together. Inspired by neuroscience, this concept helps explain how LMs achieve cross-modal understanding.
By analyzing representations spanning languages, arithmetic, code, and visual/audio data, the study demonstrates that tweaking this shared space in one modality can produce predictable effects in others.
Background
Modern LMs handle diverse data types, including multiple languages, code, math, images, and audio. Previous research suggested that LMs might use a common representation space—often centered around English—when processing different languages. However, it was unclear whether this principle extended across modalities.
This study builds on that idea, introducing the "semantic hub" hypothesis, which parallels the human brain’s transmodal hub. The researchers found that LMs naturally organize semantically similar inputs—regardless of language or modality—close together within intermediate layers, often structured around a dominant language like English.
Unlike earlier studies requiring explicit transformations to align modalities, this work demonstrated that LMs inherently develop a shared processing space. Intervention experiments further confirmed that modifying this space in the dominant language predictably influenced outputs across different data types.
The Semantic Hub Hypothesis
The semantic hub hypothesis suggests that LMs encode various data types (such as text and images) into a shared, modality-agnostic representation space. Rather than treating each modality separately, LMs map semantically similar inputs—such as translated sentences—closely together, allowing for generalization across modalities.
If an LM is trained primarily on one type of data, its representation space becomes "anchored" to that dominant type. This means the model is more likely to predict continuations in the dominant type over the original input type.
To test this, researchers analyzed encoded input similarities and expected outputs using methods like cosine similarity and the logit lens, which track internal model states. If the hypothesis holds, LMs should favor predictions in their dominant language even when prompted with a different data type.
Evidence for the Semantic Hub
The study examined whether LMs develop a shared representation space for multilingual text, arithmetic, and structured data. Experiments confirmed that LM hidden states align semantically across languages and tasks, forming a central "semantic hub."
Analyzing multilingual models like Llama-2, Llama-3, Baichuan-2, and BLOOM, researchers observed that translations exhibit highly similar intermediate representations, particularly in the middle layers. English-dominant models structure representations around English tokens, even when processing Chinese text, while Chinese-dominant models do the reverse. BLOOM, a more balanced multilingual model, exhibited a shared space without a strong language preference.
A similar pattern emerged in arithmetic processing. LMs represented numerical expressions (e.g., “5+3”) similarly to their English word equivalents (“five plus three”). Despite differences in surface forms, models maintained internal consistency in representing identical values. Probing LM layers revealed that intermediate states favored English numerical words before converging on final numeric outputs.
These findings suggest that LMs organize meaning in a unified representation space, with middle layers acting as a processing hub. This organization may explain their multilingual and cross-modal capabilities, influencing tasks like translation, arithmetic reasoning, and structured data understanding.
Intervening in the Semantic Hub
To explore the causal role of the semantic hub, the researchers intervened in hidden representations using the activation addition (ActAdd) method. They manipulated model outputs cross-linguistically and across modalities.
In multilingual settings, they applied English-based transformations to non-English inputs (Spanish and Chinese). Results showed that sentiment modifications remained effective across languages with minimal loss of fluency or relevance.
Arithmetic reasoning was also affected; modifying hidden states changed numerical predictions while preserving internal logic. Similarly, interventions in code led to systematic changes. In visual experiments, replacing hidden states of color image patches with language token embeddings misled the model into perceiving a different color. In audio tests, replacing mammal sound embeddings with non-mammal token embeddings biased model predictions toward non-mammal classifications.
These results indicate that LMs possess a shared representation space that enables cross-lingual and cross-modal interventions. Modifying key hidden states causally influenced model outputs while maintaining coherence.
Conclusion
This research provides strong evidence for the semantic hub hypothesis, showing that LMs develop a shared representation space across languages and modalities. By analyzing multilingual text, arithmetic, code, and visual/audio data, researchers demonstrated that semantically similar inputs cluster in intermediate layers, often anchored by a dominant language. Intervention experiments confirmed that modifying hidden states in one modality predictably influenced others.
Understanding this shared space could improve model interpretability, robustness, and cross-modal applications, guiding future AI and machine learning advancements.
Journal Reference
Wu, Z., Yu, X. V., Yogatama, D., Lu, J., & Kim, Y. (2024). The Semantic Hub hypothesis: Language models share semantic representations across languages and modalities. arXiv (Cornell University). DOI:10.48550/arxiv.2411.04986 https://arxiv.org/pdf/2411.04986
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.
Source:
Massachusetts Institute of Technology