This page shows you how to process files you have uploaded to your Catalog into a unified AI- and RAG-ready format.
It uses three preset Pipelines: Transformation, Splitting, and Embedding.
-
Transformation Parses different file types into a single source of truth, represented in high-quality structured Markdown text for files like PDFs/DOCXs/PPTXs/XLSXs/CSVs/etc. Process:
- Plain Text Files (.txt, .md): Text is directly extracted.
- Other File Types: Parsed into Markdown for a unified textual representation.
-
Splitting Breaks down the single source of truth into smaller chunks for enhanced search efficiency and alignment with embedding models' context windows. Process:
- Markdown Text: Uses headings to determine optimal splitting points.
- Plain Text Files: Employs a recursive strategy for segmentation without explicit headings.
-
Embedding Converts chunks into vector representations using an embedding model, which are then stored as part of the Catalog. Process:
- The chunks obtained from the Splitting step are transformed into vector representations using an embedding model.
- These vectors are efficiently stored in Catalog for low-latency retrieval.
#Process Files via API
You can process files in your Catalog by making a POST request to the processAsync
endpoint.
If you are using Instill Core as a managed service, set HOST_URL
to
https://private.instill-ai.com
. If you are self-hosting Instill Core,
use http://localhost:8080
.
#Body Parameters
fileUids
(array of strings, required): An array of file UIDs that you want to process.
Notes:
- The
fileUids
field in the request body contains an array of strings representing the unique identifiers (UIDs) of the files to be processed. You can obtain thefileUid
when you upload files to the Catalog. - The processing of files is asynchronous. You can check the processing status by retrieving the file information from the Catalog.
#Example Response
A successful response will return a JSON object containing the list of files that are being processed.
{ "files": [ { "fileUid": "fileUid1", "name": "example.pdf", "type": "FILE_TYPE_PDF", "processStatus": "FILE_PROCESS_STATUS_WAITING", "size": "102400" }, { "fileUid": "fileUid2", "name": "document.txt", "type": "FILE_TYPE_TEXT", "processStatus": "FILE_PROCESS_STATUS_WAITING", "size": "20480" } ]}
#Output Description
files
: An array of file objects that are being processed.fileUid
(string): The unique identifier of the file.name
(string): The name of the file.type
(string): The type of the file (e.g.,FILE_TYPE_PDF
,FILE_TYPE_TEXT
).processStatus
(string): The current processing status of the file. Possible values include:FILE_PROCESS_STATUS_NOTSTARTED
FILE_PROCESS_STATUS_WAITING
FILE_PROCESS_STATUS_CONVERTING
FILE_PROCESS_STATUS_CHUNKING
FILE_PROCESS_STATUS_EMBEDDING
FILE_PROCESS_STATUS_COMPLETED
FILE_PROCESS_STATUS_FAILED
size
(string): The size of the file in bytes.
#Process Files via Console
To process files from Console, follow these steps:
- Launch Console via a local Instill Core deployment at http://localhost:3000, or by selecting the
Go to console
button in the bottom-left of the Instill Agent interface if you are using Instill Core as a managed service. - Navigate to the Artifacts page using the navigation bar.
- Ensure that you have followed the steps in the Upload Files page.
- Click the
Process Files
button.
The processing status of your files appears in the Files tab. When the
status is Completed
, you can view your Files and Chunks, and
also use the Retrieve Chunks API.
Note for Instill Core Users: Ensure that you have set up a valid OpenAI API key in your environment configuration to enable the Embedding stage of the processing. See the Configuration page for more details.