Skip to content

Introduction to the document Module

The document module is a core component of the Cucaracha library, designed to facilitate the handling and processing of various document formats, including PDFs and images. This module provides a Document class that serves as the main interface for interacting with documents, offering functionalities to load, manipulate, and analyze document pages.

Key Features

  • Document Loading: Easily load documents from PDF files or image files.
  • Page Selection: Select specific pages from multi-page documents for processing.
  • Conversion to Numpy Arrays: Convert document pages to numpy arrays for further image processing and analysis.
  • Integration with Image Processing Algorithms: Seamlessly integrate with various image processing algorithms provided by the Cucaracha library and other popular libraries like OpenCV, SimpleITK, Scikit-Image, Seaborn, and Matplotlib.

Document

The general concept of Documentfor the cucaracha library.

Source code in cucaracha/__init__.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
class Document:
    """The general concept of `Document`for the `cucaracha` library."""

    def __init__(self, doc_path: str = None, **kwargs):
        """Document class constructor

        This is the basic model to a document in the cucaracha library.
        It is important to notice that the basic data for processing and
        analysis is a Numpy array, which is automatically loaded using the
        input `doc_path`.

        The input data can be passed after the object creation. However, take
        care about the metadata created at the object instantiation. When there
        is no input path provided, the default information is used, being mostly
        `None` type.

        Note:
            It is used the PyMuPDF and OpenCV libraries to allow the loading
            data into cucaracha Document object. Both libraries have extensive
            documentation informating the image files formats avaliable. See
            more details at:

            - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/index.html)

            - [OpenCV](https://opencv.org/)

        Args:
            doc_path (str, optional): Document path to be loaded. If None, a general object is created with `None` values in metadata information. Defaults to None.
        """
        self._doc_metadata = {
            'file_ext': None,
            'file_path': None,
            'file_name': None,
            'resolution': 96
            if kwargs.get('resolution') == None
            else int(kwargs.get('resolution')),
            'pages': None,
            'size': None,
        }

        self._doc_file = []
        if doc_path is not None:
            self._doc_file = self._read_by_ext(
                doc_path, dpi=self._doc_metadata['resolution']
            )

        self._collect_inner_metadata(doc_path)

    def load_document(self, path: str):
        """Load document using a full path.

        If the Document object was instantiated using a `None` value for path,
        it can be called this method to update the document data inside de object

        This method is called internally bu the `Document()` constructor.

        Note:
            It is used the PyMuPDF and OpenCV libraries to allow the loading
            data into cucaracha Document object. Both libraries have extensive
            documentation informating the image files formats avaliable. See
            more details at:

            - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/index.html)

            - [OpenCV](https://opencv.org/)

        Args:
            path (str): Document full path to be loaded.
        """
        self._doc_file = self._read_by_ext(
            path, dpi=self._doc_metadata['resolution']
        )

    def save_document(self, file_name: str):
        """Saves the Document state as a file.

        The user can choose the file format by defining on it's naming

        The conversion if based on the file format. If the `.pdf` extension
        is passed, then the PyMuPDF constructor ir used. If an image file is
        passed, e.g. `.jpg`, `.png` and so on, the OpenCV constructor is used.

        Note:
            This method saves the actual state of the document object. Hence,
            after all the image processing being made, it is possible to save
            document status using this method

        Note:
            The file path can be seen by calling the `get_metadata('file_path')`
            command, where it recovery the original file path that was given at
            the moment the object was created.

        Args:
            file_name (str): File path where it should be save in the hard
            drive. If a single filename is passed, then the original image
            path is used from the constructor metadata.

        Raises:
            ValueError: Document metadata does not have a valid file path.
            TypeError: File name must indicates the file format (ex: .pdf, .jpg, .png, etc)
        """
        if self._doc_metadata.get('file_path') is None:
            raise ValueError(
                f'Document metadata does not have a valid file path.'
            )

        filename, file_ext = os.path.splitext(file_name)
        if file_ext == '':
            raise TypeError(
                'File name must indicates the file format (ex: .pdf, .jpg, .png, etc)'
            )

        if os.sep in filename:
            if file_ext != '.pdf':
                # Save using opencv
                for page in range(self._doc_metadata.get('pages')):
                    cv.imwrite(file_name, self._doc_file[page])
            else:
                # Save using PyMuPDF
                # Create a temporary file image
                for page in range(self._doc_metadata.get('pages')):
                    cv.imwrite(
                        self._doc_metadata.get('file_path')
                        + filename
                        + '_tmp.png',
                        self._doc_file[page],
                    )
                doc = pymupdf.open()                           # new PDF
                for page in range(self._doc_metadata.get('pages')):
                    tmp_img_path = filename + '_tmp_pg_' + str(page) + '.png'
                    cv.imwrite(tmp_img_path, self._doc_file[page])

                    # open image as a document
                    imgdoc = pymupdf.open(tmp_img_path)
                    # make a 1-page PDF of it
                    pdfbytes = imgdoc.convert_to_pdf()
                    imgdoc.close()
                    imgpdf = pymupdf.open('pdf', pdfbytes)
                    # insert the image PDF
                    doc.insert_pdf(imgpdf)

                    # Removing tmp file
                    os.remove(tmp_img_path)

                doc.save(file_name)
        else:
            if file_ext != '.pdf':
                # Save using opencv
                for page in range(self._doc_metadata.get('pages')):
                    cv.imwrite(
                        self._doc_metadata.get('file_path') + file_name,
                        self._doc_file[page],
                    )
            else:
                # Save using PyMuPDF
                # Create a temporary file image
                for page in range(self._doc_metadata.get('pages')):
                    cv.imwrite(
                        self._doc_metadata.get('file_path')
                        + filename
                        + '_tmp.png',
                        self._doc_file[page],
                    )

                doc = pymupdf.open()                           # new PDF
                for page in range(self._doc_metadata.get('pages')):
                    tmp_img_path = (
                        self._doc_metadata.get('file_path')
                        + filename
                        + '_tmp.png'
                    )
                    cv.imwrite(tmp_img_path, self._doc_file[page])

                    # open image as a document
                    imgdoc = pymupdf.open(tmp_img_path)
                    # make a 1-page PDF of it
                    pdfbytes = imgdoc.convert_to_pdf()
                    imgpdf = pymupdf.open('pdf', pdfbytes)
                    # insert the image PDF
                    doc.insert_pdf(imgpdf)

                    # Removing tmp file
                    os.remove(tmp_img_path)

                doc.save(self._doc_metadata.get('file_path') + file_name)

    def get_metadata(self, info: str = None):
        """Collect the document metadata that informs general information
        about the data construction and parameters.

        This method can be called setting the type of information that you want
        to retrieve. For instance, one can see the `resolution` of the data
        object, then:

        Examples:
            >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
            >>> doc.get_metadata('resolution')
            {'resolution': 96}
            >>> doc.get_metadata('file_ext')
            {'file_ext': '.pdf'}

        If the method is called without providing a specific information, then
        all the metadata is shown

        Examples:
            >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
            >>> meta = doc.get_metadata()
            >>> type(meta)
            <class 'dict'>
            >>> meta.keys()
            dict_keys(['file_ext', 'file_path', 'file_name', 'resolution', 'pages', 'size'])

        Args:
            info (str, optional): The kind of information that desired to obtain in the document metadata. Defaults to `None`, then all the metada is shown.

        Raises:
            KeyError: Info is not provided in the Document class metadata

        Returns:
            dict: _description_
        """
        if info in self._doc_metadata.keys():
            return {info: self._doc_metadata.get(info)}
        elif info is None:
            return self._doc_metadata
        else:
            raise KeyError(
                'Info is not provided in the Document class metadata'
            )

    def get_page(self, page: int):
        """Returns a determined page of the document defined by the `page`
        parameter

        The `page` value must be inside the range of possible pages that the
        document has. If not, an error is exposed.

        Info:
            The pages counting starts from zero (`0`)

        Examples:
            >>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
            >>> page = doc.get_page(0)
            >>> page.shape
            (103, 103, 3)


        Args:
            page (int): The page number that you want to collect

        Raises:
            ValueError: page number is not present at the document

        Returns:
            np.ndarray: The selected page extracted by Numpy array format
        """
        if page not in range(self._doc_metadata.get('pages')):
            raise ValueError('page number is not present at the document')

        return self._doc_file[page]

    def set_page(self, page: np.ndarray, index: int):
        """Update a new page into the document file

        The page index must be passed considering the total range of pages
        in the document. See the metadata to get this information.

        Examples:
            >>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
            >>> doc.get_metadata('pages')
            {'pages': 1}

            The original information is loaded as usual
            >>> np.max(doc.get_page(0))
            255

            But a new page can be changed like this:
            >>> new_page = np.ones(doc.get_page(0).shape)
            >>> doc.set_page(new_page, 0)

            Then the new page is placed in the document object
            >>> np.max(doc.get_page(0))
            1.0

        Args:
            page (np.ndarray): A numpy array with the same shape of the other pages
            index (int): The index where the new page should be placed

        Raises:
            ValueError: Page index is out of range (total page is ... and must be a positive integer)
            ValueError: New page is not a numpy array or has different shape from previous pages
        """
        if index > len(self._doc_file) or index < 0:
            raise ValueError(
                f'Page index is out of range (total page is {len(self._doc_file)} and must be a positive integer)'
            )

        if (
            not isinstance(page, np.ndarray)
            or page.shape != self.get_page(index).shape
        ):
            raise ValueError(
                'New page is not a numpy array or has different shape from previous pages'
            )

        self._doc_file[index] = page

    def run_pipeline(self, processors: list):
        """Execute a list of image processing methods to the document file
        allocated in the `Document` object.

        The processing order is the same as indicated in the list of processors.

        Examples:
            One can define a processor as a function caller:
            >>> def proc2(input): return sparse_dots(input, 3)
            >>> def proc3(input): return inplane_deskew(input, 25)
            >>> proc_list = [otsu, proc2, proc3]

            After the `proc_list` being created, the proper execution can be
            called using:
            >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
            >>> doc.run_pipeline(proc_list) # doctest: +SKIP
            Applying processors... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

            Hence, the inner document file in the `doc` object is updated:
            >>> type(doc.get_page(0))
            <class 'numpy.ndarray'>

        Warning:
            All the processor in the list must be of `cucaracha` filter type.
            Hence, make sure that the processor instance accepts an numpy array
            as input and returns a tuple with numpy array and a dictionary of
            extra parameters (`(np.ndarray, dict)`).

        Note:
            All the pages presented in the document object is processed. If it
            is desired to apply only on specific pages, then it is need to
            process it individually and then update the page using the method
            `set_page`

        Args:
            processors (list): _description_
        """
        self._check_processor_list(processors)

        for proc in track(
            processors, description='[green]Applying processors...'
        ):
            for idx, page in enumerate(self._doc_file):
                self._doc_file[idx] = proc(page)[0]

    def _read_by_ext(self, path, dpi):
        _, file_ext = os.path.splitext(path)

        out_file = []
        if file_ext != '.pdf':
            out_file = [cv.imread(path)]
        else:
            out_file = self._read_pdf(path, dpi)

        return out_file

    def _read_pdf(self, path, dpi):
        doc = pymupdf.open(path)  # open document
        out_file = []
        for page in doc:  # iterate through the pages
            pix = page.get_pixmap(dpi=dpi)
            im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
                pix.h, pix.w, pix.n
            )
            im = np.ascontiguousarray(im[..., [2, 1, 0]])
            out_file.append(im)

        return out_file

    def _collect_inner_metadata(self, doc_path):
        if doc_path is not None:
            # Set file_ext, file_path and file_name
            fullpath, file_ext = os.path.splitext(doc_path)
            self._doc_metadata['file_ext'] = file_ext

            lpath = fullpath.split(sep=os.sep)
            self._doc_metadata['file_path'] = os.sep.join(lpath[:-1]) + os.sep
            self._doc_metadata['file_name'] = lpath[-1]

            # Set file size
            self._doc_metadata['size'] = (
                os.path.getsize(doc_path) / 1024**2
            )   # informs size in Mb

            # Set file number of pages
            self._doc_metadata['pages'] = len(self._doc_file)

    def _check_processor_list(self, processors):
        if type(processors) != list:
            raise ValueError(
                'processors must be a list of valid cucaracha filter methods'
            )

        for proc in processors:
            out_test = proc(self.get_page(0))   # Test the processor output
            if (
                type(out_test) != tuple
                or not isinstance(out_test[0], np.ndarray)
                or not isinstance(out_test[1], dict)
            ):
                raise TypeError(
                    f'Processor: {proc.__name__} is not valid. Unsure that the output processor is valid.'
                )

__init__(doc_path=None, **kwargs)

Document class constructor

This is the basic model to a document in the cucaracha library. It is important to notice that the basic data for processing and analysis is a Numpy array, which is automatically loaded using the input doc_path.

The input data can be passed after the object creation. However, take care about the metadata created at the object instantiation. When there is no input path provided, the default information is used, being mostly None type.

Note

It is used the PyMuPDF and OpenCV libraries to allow the loading data into cucaracha Document object. Both libraries have extensive documentation informating the image files formats avaliable. See more details at:

Parameters:

Name Type Description Default
doc_path str

Document path to be loaded. If None, a general object is created with None values in metadata information. Defaults to None.

None
Source code in cucaracha/__init__.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def __init__(self, doc_path: str = None, **kwargs):
    """Document class constructor

    This is the basic model to a document in the cucaracha library.
    It is important to notice that the basic data for processing and
    analysis is a Numpy array, which is automatically loaded using the
    input `doc_path`.

    The input data can be passed after the object creation. However, take
    care about the metadata created at the object instantiation. When there
    is no input path provided, the default information is used, being mostly
    `None` type.

    Note:
        It is used the PyMuPDF and OpenCV libraries to allow the loading
        data into cucaracha Document object. Both libraries have extensive
        documentation informating the image files formats avaliable. See
        more details at:

        - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/index.html)

        - [OpenCV](https://opencv.org/)

    Args:
        doc_path (str, optional): Document path to be loaded. If None, a general object is created with `None` values in metadata information. Defaults to None.
    """
    self._doc_metadata = {
        'file_ext': None,
        'file_path': None,
        'file_name': None,
        'resolution': 96
        if kwargs.get('resolution') == None
        else int(kwargs.get('resolution')),
        'pages': None,
        'size': None,
    }

    self._doc_file = []
    if doc_path is not None:
        self._doc_file = self._read_by_ext(
            doc_path, dpi=self._doc_metadata['resolution']
        )

    self._collect_inner_metadata(doc_path)

get_metadata(info=None)

Collect the document metadata that informs general information about the data construction and parameters.

This method can be called setting the type of information that you want to retrieve. For instance, one can see the resolution of the data object, then:

Examples:

>>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> doc.get_metadata('resolution')
{'resolution': 96}
>>> doc.get_metadata('file_ext')
{'file_ext': '.pdf'}

If the method is called without providing a specific information, then all the metadata is shown

Examples:

>>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> meta = doc.get_metadata()
>>> type(meta)
<class 'dict'>
>>> meta.keys()
dict_keys(['file_ext', 'file_path', 'file_name', 'resolution', 'pages', 'size'])

Parameters:

Name Type Description Default
info str

The kind of information that desired to obtain in the document metadata. Defaults to None, then all the metada is shown.

None

Raises:

Type Description
KeyError

Info is not provided in the Document class metadata

Returns:

Name Type Description
dict

description

Source code in cucaracha/__init__.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
def get_metadata(self, info: str = None):
    """Collect the document metadata that informs general information
    about the data construction and parameters.

    This method can be called setting the type of information that you want
    to retrieve. For instance, one can see the `resolution` of the data
    object, then:

    Examples:
        >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
        >>> doc.get_metadata('resolution')
        {'resolution': 96}
        >>> doc.get_metadata('file_ext')
        {'file_ext': '.pdf'}

    If the method is called without providing a specific information, then
    all the metadata is shown

    Examples:
        >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
        >>> meta = doc.get_metadata()
        >>> type(meta)
        <class 'dict'>
        >>> meta.keys()
        dict_keys(['file_ext', 'file_path', 'file_name', 'resolution', 'pages', 'size'])

    Args:
        info (str, optional): The kind of information that desired to obtain in the document metadata. Defaults to `None`, then all the metada is shown.

    Raises:
        KeyError: Info is not provided in the Document class metadata

    Returns:
        dict: _description_
    """
    if info in self._doc_metadata.keys():
        return {info: self._doc_metadata.get(info)}
    elif info is None:
        return self._doc_metadata
    else:
        raise KeyError(
            'Info is not provided in the Document class metadata'
        )

get_page(page)

Returns a determined page of the document defined by the page parameter

The page value must be inside the range of possible pages that the document has. If not, an error is exposed.

Info

The pages counting starts from zero (0)

Examples:

>>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> page = doc.get_page(0)
>>> page.shape
(103, 103, 3)

Parameters:

Name Type Description Default
page int

The page number that you want to collect

required

Raises:

Type Description
ValueError

page number is not present at the document

Returns:

Type Description

np.ndarray: The selected page extracted by Numpy array format

Source code in cucaracha/__init__.py
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
def get_page(self, page: int):
    """Returns a determined page of the document defined by the `page`
    parameter

    The `page` value must be inside the range of possible pages that the
    document has. If not, an error is exposed.

    Info:
        The pages counting starts from zero (`0`)

    Examples:
        >>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
        >>> page = doc.get_page(0)
        >>> page.shape
        (103, 103, 3)


    Args:
        page (int): The page number that you want to collect

    Raises:
        ValueError: page number is not present at the document

    Returns:
        np.ndarray: The selected page extracted by Numpy array format
    """
    if page not in range(self._doc_metadata.get('pages')):
        raise ValueError('page number is not present at the document')

    return self._doc_file[page]

load_document(path)

Load document using a full path.

If the Document object was instantiated using a None value for path, it can be called this method to update the document data inside de object

This method is called internally bu the Document() constructor.

Note

It is used the PyMuPDF and OpenCV libraries to allow the loading data into cucaracha Document object. Both libraries have extensive documentation informating the image files formats avaliable. See more details at:

Parameters:

Name Type Description Default
path str

Document full path to be loaded.

required
Source code in cucaracha/__init__.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def load_document(self, path: str):
    """Load document using a full path.

    If the Document object was instantiated using a `None` value for path,
    it can be called this method to update the document data inside de object

    This method is called internally bu the `Document()` constructor.

    Note:
        It is used the PyMuPDF and OpenCV libraries to allow the loading
        data into cucaracha Document object. Both libraries have extensive
        documentation informating the image files formats avaliable. See
        more details at:

        - [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/index.html)

        - [OpenCV](https://opencv.org/)

    Args:
        path (str): Document full path to be loaded.
    """
    self._doc_file = self._read_by_ext(
        path, dpi=self._doc_metadata['resolution']
    )

run_pipeline(processors)

Execute a list of image processing methods to the document file allocated in the Document object.

The processing order is the same as indicated in the list of processors.

Examples:

One can define a processor as a function caller:

>>> def proc2(input): return sparse_dots(input, 3)
>>> def proc3(input): return inplane_deskew(input, 25)
>>> proc_list = [otsu, proc2, proc3]

After the proc_list being created, the proper execution can be called using:

>>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> doc.run_pipeline(proc_list)
Applying processors... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Hence, the inner document file in the doc object is updated:

>>> type(doc.get_page(0))
<class 'numpy.ndarray'>
Warning

All the processor in the list must be of cucaracha filter type. Hence, make sure that the processor instance accepts an numpy array as input and returns a tuple with numpy array and a dictionary of extra parameters ((np.ndarray, dict)).

Note

All the pages presented in the document object is processed. If it is desired to apply only on specific pages, then it is need to process it individually and then update the page using the method set_page

Parameters:

Name Type Description Default
processors list

description

required
Source code in cucaracha/__init__.py
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
def run_pipeline(self, processors: list):
    """Execute a list of image processing methods to the document file
    allocated in the `Document` object.

    The processing order is the same as indicated in the list of processors.

    Examples:
        One can define a processor as a function caller:
        >>> def proc2(input): return sparse_dots(input, 3)
        >>> def proc3(input): return inplane_deskew(input, 25)
        >>> proc_list = [otsu, proc2, proc3]

        After the `proc_list` being created, the proper execution can be
        called using:
        >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
        >>> doc.run_pipeline(proc_list) # doctest: +SKIP
        Applying processors... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

        Hence, the inner document file in the `doc` object is updated:
        >>> type(doc.get_page(0))
        <class 'numpy.ndarray'>

    Warning:
        All the processor in the list must be of `cucaracha` filter type.
        Hence, make sure that the processor instance accepts an numpy array
        as input and returns a tuple with numpy array and a dictionary of
        extra parameters (`(np.ndarray, dict)`).

    Note:
        All the pages presented in the document object is processed. If it
        is desired to apply only on specific pages, then it is need to
        process it individually and then update the page using the method
        `set_page`

    Args:
        processors (list): _description_
    """
    self._check_processor_list(processors)

    for proc in track(
        processors, description='[green]Applying processors...'
    ):
        for idx, page in enumerate(self._doc_file):
            self._doc_file[idx] = proc(page)[0]

save_document(file_name)

Saves the Document state as a file.

The user can choose the file format by defining on it's naming

The conversion if based on the file format. If the .pdf extension is passed, then the PyMuPDF constructor ir used. If an image file is passed, e.g. .jpg, .png and so on, the OpenCV constructor is used.

Note

This method saves the actual state of the document object. Hence, after all the image processing being made, it is possible to save document status using this method

Note

The file path can be seen by calling the get_metadata('file_path') command, where it recovery the original file path that was given at the moment the object was created.

Parameters:

Name Type Description Default
file_name str

File path where it should be save in the hard

required

Raises:

Type Description
ValueError

Document metadata does not have a valid file path.

TypeError

File name must indicates the file format (ex: .pdf, .jpg, .png, etc)

Source code in cucaracha/__init__.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def save_document(self, file_name: str):
    """Saves the Document state as a file.

    The user can choose the file format by defining on it's naming

    The conversion if based on the file format. If the `.pdf` extension
    is passed, then the PyMuPDF constructor ir used. If an image file is
    passed, e.g. `.jpg`, `.png` and so on, the OpenCV constructor is used.

    Note:
        This method saves the actual state of the document object. Hence,
        after all the image processing being made, it is possible to save
        document status using this method

    Note:
        The file path can be seen by calling the `get_metadata('file_path')`
        command, where it recovery the original file path that was given at
        the moment the object was created.

    Args:
        file_name (str): File path where it should be save in the hard
        drive. If a single filename is passed, then the original image
        path is used from the constructor metadata.

    Raises:
        ValueError: Document metadata does not have a valid file path.
        TypeError: File name must indicates the file format (ex: .pdf, .jpg, .png, etc)
    """
    if self._doc_metadata.get('file_path') is None:
        raise ValueError(
            f'Document metadata does not have a valid file path.'
        )

    filename, file_ext = os.path.splitext(file_name)
    if file_ext == '':
        raise TypeError(
            'File name must indicates the file format (ex: .pdf, .jpg, .png, etc)'
        )

    if os.sep in filename:
        if file_ext != '.pdf':
            # Save using opencv
            for page in range(self._doc_metadata.get('pages')):
                cv.imwrite(file_name, self._doc_file[page])
        else:
            # Save using PyMuPDF
            # Create a temporary file image
            for page in range(self._doc_metadata.get('pages')):
                cv.imwrite(
                    self._doc_metadata.get('file_path')
                    + filename
                    + '_tmp.png',
                    self._doc_file[page],
                )
            doc = pymupdf.open()                           # new PDF
            for page in range(self._doc_metadata.get('pages')):
                tmp_img_path = filename + '_tmp_pg_' + str(page) + '.png'
                cv.imwrite(tmp_img_path, self._doc_file[page])

                # open image as a document
                imgdoc = pymupdf.open(tmp_img_path)
                # make a 1-page PDF of it
                pdfbytes = imgdoc.convert_to_pdf()
                imgdoc.close()
                imgpdf = pymupdf.open('pdf', pdfbytes)
                # insert the image PDF
                doc.insert_pdf(imgpdf)

                # Removing tmp file
                os.remove(tmp_img_path)

            doc.save(file_name)
    else:
        if file_ext != '.pdf':
            # Save using opencv
            for page in range(self._doc_metadata.get('pages')):
                cv.imwrite(
                    self._doc_metadata.get('file_path') + file_name,
                    self._doc_file[page],
                )
        else:
            # Save using PyMuPDF
            # Create a temporary file image
            for page in range(self._doc_metadata.get('pages')):
                cv.imwrite(
                    self._doc_metadata.get('file_path')
                    + filename
                    + '_tmp.png',
                    self._doc_file[page],
                )

            doc = pymupdf.open()                           # new PDF
            for page in range(self._doc_metadata.get('pages')):
                tmp_img_path = (
                    self._doc_metadata.get('file_path')
                    + filename
                    + '_tmp.png'
                )
                cv.imwrite(tmp_img_path, self._doc_file[page])

                # open image as a document
                imgdoc = pymupdf.open(tmp_img_path)
                # make a 1-page PDF of it
                pdfbytes = imgdoc.convert_to_pdf()
                imgpdf = pymupdf.open('pdf', pdfbytes)
                # insert the image PDF
                doc.insert_pdf(imgpdf)

                # Removing tmp file
                os.remove(tmp_img_path)

            doc.save(self._doc_metadata.get('file_path') + file_name)

set_page(page, index)

Update a new page into the document file

The page index must be passed considering the total range of pages in the document. See the metadata to get this information.

Examples:

>>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> doc.get_metadata('pages')
{'pages': 1}

The original information is loaded as usual

>>> np.max(doc.get_page(0))
255

But a new page can be changed like this:

>>> new_page = np.ones(doc.get_page(0).shape)
>>> doc.set_page(new_page, 0)

Then the new page is placed in the document object

>>> np.max(doc.get_page(0))
1.0

Parameters:

Name Type Description Default
page ndarray

A numpy array with the same shape of the other pages

required
index int

The index where the new page should be placed

required

Raises:

Type Description
ValueError

Page index is out of range (total page is ... and must be a positive integer)

ValueError

New page is not a numpy array or has different shape from previous pages

Source code in cucaracha/__init__.py
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
def set_page(self, page: np.ndarray, index: int):
    """Update a new page into the document file

    The page index must be passed considering the total range of pages
    in the document. See the metadata to get this information.

    Examples:
        >>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
        >>> doc.get_metadata('pages')
        {'pages': 1}

        The original information is loaded as usual
        >>> np.max(doc.get_page(0))
        255

        But a new page can be changed like this:
        >>> new_page = np.ones(doc.get_page(0).shape)
        >>> doc.set_page(new_page, 0)

        Then the new page is placed in the document object
        >>> np.max(doc.get_page(0))
        1.0

    Args:
        page (np.ndarray): A numpy array with the same shape of the other pages
        index (int): The index where the new page should be placed

    Raises:
        ValueError: Page index is out of range (total page is ... and must be a positive integer)
        ValueError: New page is not a numpy array or has different shape from previous pages
    """
    if index > len(self._doc_file) or index < 0:
        raise ValueError(
            f'Page index is out of range (total page is {len(self._doc_file)} and must be a positive integer)'
        )

    if (
        not isinstance(page, np.ndarray)
        or page.shape != self.get_page(index).shape
    ):
        raise ValueError(
            'New page is not a numpy array or has different shape from previous pages'
        )

    self._doc_file[index] = page