From Inconsistent Descriptions to Verified Retrieval Through a Local Data Cleaning Framework for Higher Education Climate Action Keyword Mapping

by Jagath Gunatilake, PMPC Gunathilake, Tilak Hewawasam

Published: June 13, 2026 • DOI: 10.47772/IJRISS.2026.1026EDU0324

Abstract

Higher Education Institutions (HEIs) are increasingly expected to report sustainability and climate-related activities aligned with Nationally Determined Contributions (NDCs), aggregated through Locally Determined Contributions (LDCs), and recognised within national climate governance systems. This alignment is difficult when university activity descriptions use incomplete, inconsistent, or non-standard climate keywords, preventing automated detection of climate actions and leading to incorrect NDC mapping and unreliable inputs for future carbon quantification. NDC documents further compound this challenge by containing structurally similar actions across different sectors and contradictory actions arising from sectoral trade-offs, issues that often stem from limited cross-sectoral coordination during preparation. This study develops a data cleaning framework for higher education climate action keyword mapping, using Sri Lanka as a pilot study. Sri Lanka’s updated NDCs serve as the national reference framework, enabling the methodology to be tested within a context where climate reporting infrastructure and cross-sectoral coordination remain emergent. The methodology applies Natural Language Processing (NLP) techniques, Jaccard similarity, and cosine similarity to identify sectoral overlaps and action-level conflicts. The framework then cleans climate-policy and sustainability-reporting text by removing duplicates, normalising terminology, preserving metadata, and constructing a locally relevant keyword set for mitigation, adaptation, and cross-cutting climate actions derived from sources aligned with IPCC, UNFCCC, NDC, SDG, and university sustainability reporting standards. Pilot results from Sri Lanka confirm the framework’s effectiveness. The NDC diagnostic stage identified high-similarity action pairs, including forestry–coastal ecosystem restoration (0.85), industry–waste circular economy (0.82), power–industry renewable energy (0.78), and water–agriculture efficiency (0.75). The validated keyword lexicon achieved a retrieval precision of 0.83 and recall of 0.79, with mitigation keywords reaching 0.87 precision and adaptation terms scoring 0.76. The study concludes that the Sri Lanka pilot demonstrates the framework’s transferability to other nations, transforming noisy sustainability descriptions into verified, retrieval-ready evidence that strengthens HEI contributions to national climate reporting systems.