SYNTHETIC DOCUMENT GENERATION FOR THE TASK OF VISUAL DOCUMENT UNDERSTANDING
DOI:
https://doi.org/10.46991/PYSUA.2024.58.3.079Keywords:
machine learning, data generation, document understandingAbstract
Solving the problem of document analysis using machine learning methods requires a large amount of labeled data. Such data is not always available, and if available, it only covers certain types of documents. In this paper, we present a method for creating synthetic data that allows creating documents of any type by pre-defining the document components. By changing the arrangement of document components, text content, and visual elements using configurations, we create diverse and realistic datasets that mimic real documents. This method addresses the problem of the lack of labeled datasets and offers a flexible solution to improve the results of a machine learning model.
References
Kardas M., Czapla P., et al. AxCell: Automatic Extraction of Results from Machine Learning Papers. Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP) (2020), 8580-8594. https://doi.org/10.18653/v1/2020.emnlp-main.692
Park S., Shin S., et al. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing (2022).
Jaume G., Ekenel H.K., Thiran J.-P. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents (2019).
https://doi.org/10.1109/ICDARW.2019.10029
Huang Z., Chen K., et al. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. 2019 Int. Conf. on Document Analysis and Recognition (ICDAR) (2021), 8580-8594. http://dx.doi.org/10.1109/ICDAR.2019.00244
Stanisławek T., Graliński F., et al. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. Lecture Notes in Computer Science 12856 (2021), 428-444. http://dx.doi.org/10.1007/978-3-030-86549-8_36
Smock B., Pesala R., Abraham R. PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents (2021). http://dx.doi.org/10.1109/CVPR52688.2022.00459
Wang Z., Zhou Y., et al. VRDU: A Benchmark for Visually-rich Document Understanding. Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2023), 5184-5193. https://doi.org/10.1145/3580305.3599929
Capobianco S., Marinai S. DocEmul: a Toolkit to Generate Structured Historical Documents (2017). http://dx.doi.org/10.1109/ICDAR.2017.196
Raman N., Shah S., Veloso M. Synthetic Document Generator for Annotation-free Layout Recognition. Pattern Recognition 120 (2021), 108660. http://dx.doi.org/10.1016/j.patcog.2022.108660
Faraglia D., et al. Faker. [Software]. Retrieved from https://github.com/joke2k/faker
Yeghiazaryan A., Khechoyan K., et al. Tokengrid: Toward More Efficient Data Extraction from Unstructured Documents. IEEE Access 10 (2022), 39261-39268. https://doi.org/10.1109/ACCESS.2022.3164674
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Proceedings of the YSU

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.