AUTHOR=Atagong Salvador D. , Tonnang Henri , Senagi Kennedy , Wamalwa Mark , Agboka Komi M. , Odindi John TITLE=A review on knowledge and information extraction from PDF documents and storage approaches JOURNAL=Frontiers in Artificial Intelligence VOLUME=Volume 8 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1466092 DOI=10.3389/frai.2025.1466092 ISSN=2624-8212 ABSTRACT=IntroductionAutomating the extraction of information from Portable Document Format (PDF) documents represents a major advancement in information extraction, with applications in various domains such as healthcare, law, or biochemistry. However, existing solutions face challenges related to accuracy, domain adaptability, and implementation complexity.MethodsA systematic review of the literature was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology to examine approaches and trends in PDF information extraction and storage approaches.ResultsThe review revealed three dominant methodological categories: rule-based systems, statistical learning models, and neural network-based approaches. Key limitations include the rigidity of rule-based methods, the lack of annotated domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.DiscussionTo overcome these limitations, a conceptual framework is proposed comprising nine core components: project manager, document manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework aims to improve the accuracy, adaptability, and usability of PDF information extraction systems.