Pl@ntBERT: leveraging large language models to enhance vegetation classification through species composition analysis
Date:
Link to the congress: click here
Biodiversity is under pressure, as many disturbance events threaten natural areas. Therefore, habitat distribution mapping is increasingly relevant for monitoring their statuses. It aims to quantify the mathematical relationships between predictors and occurrences of categorized locations. Thus, advanced numerical technologies are more required than ever. They help summarizing our knowledge of species assemblages. Herein, we present Pl@ntBERT, a framework that encodes vegetation patterns and enhances their classifications. This tool leverages computer science and linguistic processes based on transformers. In particular, the pipeline implements two artificial intelligence tasks: fill-mask and text classification. Firstly, masked language modeling gets a statistical understanding of vascular plant compositions. Then, subsequent training assigns a label to sentences describing phytosociological relevés. The fine-tuning of a pretrained foundation model on in-domain words shows significant upgrade and clearly outperforms previous state-of-the-art methods. The software pushes the accuracy score on a database containing millions of European surveys to 92.48%. Finally, our results showcase that flora is a strong marker of ecosystems and doesn’t need to be coupled with environmental data to train neural networks. The proposed application has a vocabulary covering over ten thousand organisms. This approach offers a methodology for advancing our comprehension in community ecology and conservation biology.