{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Document Analysis Technology\n", "\n", "This chapter mainly introduces the theoretical knowledge of document analysis technology, including background introduction, algorithm classification and corresponding ideas.\n", "\n", "Through the study of this chapter, you can master:\n", "\n", "1. Classification and typical ideas of layout analysis\n", "2. Classification and typical ideas of table recognition\n", "3. Classification and typical ideas of information extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify the content of document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layout, formats, poor document image quality, and the lack of large-scale annotated datasets make layout analysis still a challenging task. Document analysis often includes the following research directions:\n", "\n", "1. Layout analysis module: Divide each document page into different content areas. This module can be used not only to delimit relevant and irrelevant areas, but also to classify the types of content it recognizes.\n", "2. Optical Character Recognition (OCR) module: Locate and recognize all text present in the document.\n", "3. Form recognition module: Recognize and convert the form information in the document into an excel file.\n", "4. Information extraction module: Use OCR results and image information to understand and identify the specific information expressed in the document or the relationship between the information.\n", "\n", "Since the OCR module has been introduced in detail in the previous chapters, the following three modules will be introduced separately for the above layout analysis, table recognition and information extraction. For each module, the classic or common methods and data sets of the module will be introduced." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Layout Analysis\n", "\n", "### 1.1 Background Introduction\n", "\n", "Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layouts, formats, poor document image quality, and the lack of large-scale annotated data sets make layout analysis still a challenging task.\n", "The visualization of the layout analysis task is shown in the figure below:\n", "