Version: v0.1.0

File Upload

File Upload is a feature for indexing various document file formats such as PDF, DOCX into a Knowledge. Uploaded documents are automatically converted into searchable knowledge data through parsing, chunking, and embedding processes.

Supported File Formats

Format	Extension	Max Size
PDF	`.pdf`	100MB
Word	`.docx`, `.doc`	100MB
PowerPoint	`.pptx`	100MB
Excel	`.xlsx`	100MB
CSV	`.csv`	100MB
HTML	`.html`	100MB
Text	`.txt`	100MB
Markdown	`.md`	100MB

Upload Screen

The file upload screen can be accessed at the /knowledge/:knowledgeId/documents/files/upload path. Click the Upload Files button from the Documents tab on the Knowledge detail page.

The upload consists of a 3-step wizard:

Step	Name	Description
Step 1	Upload Files	File selection and parsing option configuration
Step 2	Chunk Settings	Chunking strategy and parameter configuration
Step 3	Review & Submit	Review indexing options and final submission

Step 1: File Selection

Drag and drop files onto the drop zone or click to select files.

Multi-file upload: You can select or drag multiple files at once
File removal: Individual files can be removed from the added file list
Format filtering: Files with unsupported formats are automatically filtered out

At this step, configure Title Override, Description, and parsing options for the document.

Processing Options

Parsing Options (Step 1)

Parsing options control how text is extracted from documents.

Option	Description	Default	Notes
Table Structure Extraction	Extract tables from documents in structured format	Enabled	Switch on/off
Table Extraction Mode	Table extraction accuracy setting	`FAST`	`FAST` or `ACCURATE`
Use OCR	Enable text recognition in images	Disabled	Switch on/off
OCR Engine	Select the OCR engine to use	`AUTO`	See table below
Document Timeout	Maximum wait time for parsing (seconds)	`90`	Min 10 ~ Max 600

OCR Engine Options

Engine	Description
`AUTO`	System automatically selects the optimal engine
`TESSERACT`	Google Tesseract OCR
`EASYOCR`	EasyOCR (multi-language support)
`RAPIDOCR`	RapidOCR (lightweight high-speed engine)

Chunking Options (Step 2)

Chunking options determine how extracted text is split into search units (chunks).

Option	Description	Default	Range
Chunking Strategy	Text splitting algorithm	`hybrid`	See table below
Max Chunk Length	Maximum tokens per chunk	`500`	100 ~ 4,000
Overlap Length	Number of overlapping tokens between adjacent chunks	`50`	0 ~ 500

Chunking Strategies

Strategy	Description	Recommended Use
Hybrid	Combined structure + size-based splitting	General purpose (default, recommended)
Markdown	Splits by markdown headings	`.md` files, structured documents
Hierarchical	Splits based on document hierarchy	Long documents with clear sections
Fixed Size	Uniform splitting by fixed size	Unstructured text
Parent-Child	Hierarchical splitting with parent-child relationships	Documents requiring detailed search

Chunk Preview

In Step 2, you can preview chunking results in real-time through the Chunk Preview panel with the current settings. Verify chunk sizes and splitting quality before actual indexing.

Indexing Options (Step 3)

Indexing options configure how chunks are stored and the search mode.

Option	Description	Default
Search Mode	Search method used for indexing	`HYBRID`

Search Modes

Mode	Description
Semantic Only	Stored in Vector DB only, semantic similarity search
Keyword Only	Stored in Text search DB only, BM25 keyword search
Hybrid (Recommended)	Stored in both Vector + Text, combining both search methods

Search Mode and Storage

The search mode is linked to the Knowledge's storage target (configured in Settings). For example, if only Vector DB is enabled for the Knowledge, data will not be stored in Text DB even if Keyword Only mode is selected.

Post-Upload Status Monitoring

After submitting files, documents automatically go through the processing pipeline. You can check the current status of each document in the document list.

Document Status

Status	Color	Description
Ready	Blue	Waiting for indexing
Processing	Blue (animated)	Parsing/chunking/indexing in progress
Completed	Green	Indexing complete
Paused	Yellow	Paused
Failed	Red	Processing failed
Cancelled	Gray	Cancelled by user
Reindexing	Blue (animated)	Reindexing in progress

Documents in active states (Ready, Processing, Reindexing) are automatically polled at 5-second intervals to refresh with the latest status.

Actions by Status

Processing: Can be stopped (Stop)
Paused: Can be resumed (Resume) or cancelled (Cancel)
Completed: Can be reindexed (Reindex)
Failed: Can check error message and retry

Document Detail Page

Click a completed document to navigate to the detail page at /knowledge/:knowledgeId/documents/files/:docId. On this page you can view:

Indexing Status Card: Current processing status and progress information
Chunk List: Browse generated chunks with pagination
File Download: Download the original file
Reindexing: Reindex with modified options

tip

Precise mode (ACCURATE table extraction + OCR enabled) takes longer to process but produces more accurate results for PDFs with complex layouts. Default settings are sufficient for simple text documents.

Large Files

When processing large files (50MB or more), it's recommended to set the document_timeout value higher than the default (90 seconds). Parsing may fail if the timeout is exceeded.

Next Steps

Manual Documents - Write and register text chunks directly
Web Crawling - URL-based automatic web page collection
Chunking and Options - Detailed guide on chunking strategies and advanced options
Search Test - Verify search quality of indexed documents

Supported File Formats​

Upload Screen​

Step 1: File Selection​

Processing Options​

Parsing Options (Step 1)​

OCR Engine Options​

Chunking Options (Step 2)​

Chunking Strategies​

Indexing Options (Step 3)​

Search Modes​

Post-Upload Status Monitoring​

Document Status​

Actions by Status​

Document Detail Page​

Next Steps​