Process Files | Documentation

This page shows you how to process files you have uploaded to your Catalog into a unified AI- and RAG-ready format.

It uses three preset Pipelines: Transformation, Splitting, and Embedding.

Transformation Parses different file types into a single source of truth, represented in high-quality structured Markdown text for files like PDFs/DOCXs/PPTXs/XLSXs/CSVs/etc. Process:
- Plain Text Files (.txt, .md): Text is directly extracted.
- Other File Types: Parsed into Markdown for a unified textual representation.
Splitting Breaks down the single source of truth into smaller chunks for enhanced search efficiency and alignment with embedding models' context windows. Process:
- Markdown Text: Uses headings to determine optimal splitting points.
- Plain Text Files: Employs a recursive strategy for segmentation without explicit headings.
Embedding Converts chunks into vector representations using an embedding model, which are then stored as part of the Catalog. Process:
- The chunks obtained from the Splitting step are transformed into vector representations using an embedding model.
- These vectors are efficiently stored in Catalog for low-latency retrieval.

#Process Files via API

You can process files in your Catalog by making a POST request to the processAsync endpoint.

cURL

Python

export INSTILL_API_TOKEN=********
curl -X POST 'HOST_URL/v1alpha/catalogs/files/processAsync' \
--header "Authorization: Bearer $INSTILL_API_TOKEN" \
--header "Content-Type: application/json" \
--data-raw '{
    "fileUids": ["fileUid1", "fileUid2"]
}'

INFO

If you are using Instill Core as a managed service, set HOST_URL to https://private.instill-ai.com. If you are self-hosting Instill Core, use http://localhost:8080.

#Body Parameters

fileUids (array of strings, required): An array of file UIDs that you want to process.

Notes:

The fileUids field in the request body contains an array of strings representing the unique identifiers (UIDs) of the files to be processed. You can obtain the fileUid when you upload files to the Catalog.
The processing of files is asynchronous. You can check the processing status by retrieving the file information from the Catalog.

#Example Response

A successful response will return a JSON object containing the list of files that are being processed.

{
  "files": [
    {
      "fileUid": "fileUid1",
      "name": "example.pdf",
      "type": "FILE_TYPE_PDF",
      "processStatus": "FILE_PROCESS_STATUS_WAITING",
      "size": "102400"
    },
    {
      "fileUid": "fileUid2",
      "name": "document.txt",
      "type": "FILE_TYPE_TEXT",
      "processStatus": "FILE_PROCESS_STATUS_WAITING",
      "size": "20480"
    }
  ]
}

#Output Description

files: An array of file objects that are being processed.
- fileUid (string): The unique identifier of the file.
- name (string): The name of the file.
- type (string): The type of the file (e.g., FILE_TYPE_PDF, FILE_TYPE_TEXT).
- processStatus (string): The current processing status of the file. Possible values include:
  - FILE_PROCESS_STATUS_NOTSTARTED
  - FILE_PROCESS_STATUS_WAITING
  - FILE_PROCESS_STATUS_CONVERTING
  - FILE_PROCESS_STATUS_CHUNKING
  - FILE_PROCESS_STATUS_EMBEDDING
  - FILE_PROCESS_STATUS_COMPLETED
  - FILE_PROCESS_STATUS_FAILED
- size (string): The size of the file in bytes.

#Process Files via Console

To process files from Console, follow these steps:

Launch Console via a local Instill Core deployment at http://localhost:3000, or by selecting the Go to console button in the bottom-left of the Instill Agent interface if you are using Instill Core as a managed service.
Navigate to the Artifacts page using the navigation bar.
Ensure that you have followed the steps in the Upload Files page.
Click the Process Files button.

The processing status of your files appears in the Files tab. When the status is Completed, you can view your Files and Chunks, and also use the Retrieve Chunks API.

Note for Instill Core Users: Ensure that you have set up a valid OpenAI API key in your environment configuration to enable the Embedding stage of the processing. See the Configuration page for more details.