Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report
arXiv:2511.16417v3 Announce Type: replace
Abstract: Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial governance, transforming capital allocation architectures, regulatory frameworks, and systemic risk coordination mechanisms. However, as the core medium for assessing corporate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic reading order from slide-like irregular layouts and implicit hierarchies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a unified framework that transforms ESG reports into structured representations through multimodal parsing, contextual narration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware segmentation guided by table-of-contents anchors, and a multi-modal aggregation pipeline that contextually transforms visual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical demands of financial research. Extensive experiments on annotated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG reports, spanning Mainland China, Hong Kong, and U.S. markets, featuring unified structured representations of multi-modal content, enriched with fine-grained layout and semantic annotations to better support ESG integration in financial governance and decision-making.