SYNTHETIC DOCUMENT GENERATION FOR THE TASK OF VISUAL DOCUMENT UNDERSTANDING

Authors

DOI:

https://doi.org/10.46991/PYSUA.2024.58.3.079

Keywords:

machine learning, data generation, document understanding

Abstract

Solving the problem of document analysis using machine learning methods requires a large amount of labeled data. Such data is not always available, and if available, it only covers certain types of documents. In this paper, we present a method for creating synthetic data that allows creating documents of any type by pre-defining the document components. By changing the arrangement of document components, text content, and visual elements using configurations, we create diverse and realistic datasets that mimic real documents. This method addresses the problem of the lack of labeled datasets and offers a flexible solution to improve the results of a machine learning model.

References

Kardas M., Czapla P., et al. AxCell: Automatic Extraction of Results from Machine Learning Papers. Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP) (2020), 8580-8594. https://doi.org/10.18653/v1/2020.emnlp-main.692

Park S., Shin S., et al. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing (2022).

Jaume G., Ekenel H.K., Thiran J.-P. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents (2019).

https://doi.org/10.1109/ICDARW.2019.10029

Huang Z., Chen K., et al. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. 2019 Int. Conf. on Document Analysis and Recognition (ICDAR) (2021), 8580-8594. http://dx.doi.org/10.1109/ICDAR.2019.00244

Stanisławek T., Graliński F., et al. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. Lecture Notes in Computer Science 12856 (2021), 428-444. http://dx.doi.org/10.1007/978-3-030-86549-8_36

Smock B., Pesala R., Abraham R. PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents (2021). http://dx.doi.org/10.1109/CVPR52688.2022.00459

Wang Z., Zhou Y., et al. VRDU: A Benchmark for Visually-rich Document Understanding. Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2023), 5184-5193. https://doi.org/10.1145/3580305.3599929

Capobianco S., Marinai S. DocEmul: a Toolkit to Generate Structured Historical Documents (2017). http://dx.doi.org/10.1109/ICDAR.2017.196

Raman N., Shah S., Veloso M. Synthetic Document Generator for Annotation-free Layout Recognition. Pattern Recognition 120 (2021), 108660. http://dx.doi.org/10.1016/j.patcog.2022.108660

Faraglia D., et al. Faker. [Software]. Retrieved from https://github.com/joke2k/faker

Yeghiazaryan A., Khechoyan K., et al. Tokengrid: Toward More Efficient Data Extraction from Unstructured Documents. IEEE Access 10 (2022), 39261-39268. https://doi.org/10.1109/ACCESS.2022.3164674

Downloads

Published

2025-01-25

Issue

Section

Informatics

How to Cite

Khechoyan, K. S. (2025). SYNTHETIC DOCUMENT GENERATION FOR THE TASK OF VISUAL DOCUMENT UNDERSTANDING. Proceedings of the YSU A: Physical and Mathematical Sciences, 58(3 (265), 79-87. https://doi.org/10.46991/PYSUA.2024.58.3.079