Skip to main content

File Extractor Node

The File Extractor Node is responsible for processing raw documents (such as PDFs, spreadsheets, and images) and extracting their textual content. It functions as a "translator," converting binary files into plain text that can be read and analyzed by Language Models (LLMs).

File Extractor Node

Configuration

This node has a simplified configuration, focused solely on identifying which files should be processed.

Extractor Configuration Panel

Input

You need to select the variable that contains the list of files. Typically, these files come from the Input node.

  • Field: "Input".
  • What to select: Look for the variable of type Files defined at the beginning of your flow (e.g., {{ input.uploaded_documents }}).

Supported Formats

The platform supports automatic extraction from a wide variety of formats:

CategorySupported ExtensionsNotes
Documents.pdf, .docx, .doc, .txt, .rtfExtraction of structured text.
Spreadsheets.xlsx, .xls, .csvConverts tables into readable text.
Presentations.pptx, .pptExtracts text from slides.
Images.png, .jpg, .jpeg, .tiffUses OCR (Optical Character Recognition) to read text within the image.
Others.html, .xml, .json-

Output Variables

After processing, this node generates a structured output ready to be sent to an LLM.

  • contents: A list containing the text extracted from each file.

How to use in the LLM?

In the next node (usually an LLM), you can reference the extracted content like this:

Analyze the following documents and provide a summary:
{{ file_extractor.contents }}

Example Flow

A common use case is creating an assistant that analyzes resumes or contracts:

  • Input Node: Defines a file field of type "Files".

  • File Extractor: Receives {{ input.files }}.

  • LLM: Receives {{ extractor.contents }} with the instruction: "Extract the name and expiration date from these documents".

  • Output: Returns the structured data.