Skip to main content

CPF/CNPJ Extractor Node

The CPF Extractor node is a specialized tool for identifying, validating, and extracting CPF (Individual Taxpayer Registry) and CNPJ (National Corporate Taxpayer Registry) numbers from processed texts.

Unlike a simple text search, this node applies validation algorithms (check digits) and conflict detection logic.

CPF Extractor Node on Canvas

Prerequisites

This node does not read files directly. It needs to receive the text content that has already been extracted from a document. Therefore, the standard flow is:

  1. Input (Receives the file)
  2. File Extractor (Converts PDF/Image to text)
  3. CPF Extractor (Reads the text and searches for documents)

Configuration

The configuration is simple and straightforward, requiring only the connection to the data source.

CPF Extractor Configuration Panel

Input

You must select the variable that contains the text content of the files.

  • Field: "Input".
  • What to select: Look for the output of the previous extraction node, usually {{ extrator_de_arquivos.contents }}.

What Does It Detect?

The node is capable of identifying formatted and non-formatted patterns:

  • CPF: 123.456.789-00 or 12345678900
  • CNPJ: 12.345.678/0001-00 or 12345678000100

Conflict Detection

A powerful feature of this node is the Conflict alert. It will mark a document as conflicting if:

  1. It finds multiple different CPFs in a document that should be personal.
  2. It finds inconsistent formatting.

This is extremely useful for automatic document screening.


Output Variables

The node generates a list of results (results) containing the data found per file.

Output Example (JSON)

[
{
"filename": "contract_joao.pdf",
"cpfs": ["123.456.789-00"],
"cnpjs": [],
"conflict": false
},
{
"filename": "strange_document.pdf",
"cpfs": ["111.222.333-44", "999.888.777-66"],
"conflict": true,
"conflict_list": ["Multiple CPFs found"]
}
]