File Upload
File Upload is a feature for indexing various document file formats such as PDF, DOCX into a Knowledge. Uploaded documents are automatically converted into searchable knowledge data through parsing, chunking, and embedding processes.
Supported File Formats
| Format | Extension | Max Size |
|---|---|---|
.pdf | 100MB | |
| Word | .docx, .doc | 100MB |
| PowerPoint | .pptx | 100MB |
| Excel | .xlsx | 100MB |
| CSV | .csv | 100MB |
| HTML | .html | 100MB |
| Text | .txt | 100MB |
| Markdown | .md | 100MB |
Upload Screen
The file upload screen can be accessed at the /knowledge/:knowledgeId/documents/files/upload path. Click the Upload Files button from the Documents tab on the Knowledge detail page.
The upload consists of a 3-step wizard:
| Step | Name | Description |
|---|---|---|
| Step 1 | Upload Files | File selection and parsing option configuration |
| Step 2 | Chunk Settings | Chunking strategy and parameter configuration |
| Step 3 | Review & Submit | Review indexing options and final submission |
Step 1: File Selection
Drag and drop files onto the drop zone or click to select files.
- Multi-file upload: You can select or drag multiple files at once
- File removal: Individual files can be removed from the added file list
- Format filtering: Files with unsupported formats are automatically filtered out
At this step, configure Title Override, Description, and parsing options for the document.
Processing Options
Parsing Options (Step 1)
Parsing options control how text is extracted from documents.
| Option | Description | Default | Notes |
|---|---|---|---|
| Table Structure Extraction | Extract tables from documents in structured format | Enabled | Switch on/off |
| Table Extraction Mode | Table extraction accuracy setting | FAST | FAST or ACCURATE |
| Use OCR | Enable text recognition in images | Disabled | Switch on/off |
| OCR Engine | Select the OCR engine to use | AUTO | See table below |
| Document Timeout | Maximum wait time for parsing (seconds) | 90 | Min 10 ~ Max 600 |
OCR Engine Options
| Engine | Description |
|---|---|
AUTO | System automatically selects the optimal engine |
TESSERACT | Google Tesseract OCR |
EASYOCR | EasyOCR (multi-language support) |
RAPIDOCR | RapidOCR (lightweight high-speed engine) |
Chunking Options (Step 2)
Chunking options determine how extracted text is split into search units (chunks).
| Option | Description | Default | Range |
|---|---|---|---|
| Chunking Strategy | Text splitting algorithm | hybrid | See table below |
| Max Chunk Length | Maximum tokens per chunk | 500 | 100 ~ 4,000 |
| Overlap Length | Number of overlapping tokens between adjacent chunks | 50 | 0 ~ 500 |
Chunking Strategies
| Strategy | Description | Recommended Use |
|---|---|---|
| Hybrid | Combined structure + size-based splitting | General purpose (default, recommended) |
| Markdown | Splits by markdown headings | .md files, structured documents |
| Hierarchical | Splits based on document hierarchy | Long documents with clear sections |
| Fixed Size | Uniform splitting by fixed size | Unstructured text |
| Parent-Child | Hierarchical splitting with parent-child relationships | Documents requiring detailed search |
In Step 2, you can preview chunking results in real-time through the Chunk Preview panel with the current settings. Verify chunk sizes and splitting quality before actual indexing.
Indexing Options (Step 3)
Indexing options configure how chunks are stored and the search mode.
| Option | Description | Default |
|---|---|---|
| Search Mode | Search method used for indexing | HYBRID |
Search Modes
| Mode | Description |
|---|---|
| Semantic Only | Stored in Vector DB only, semantic similarity search |
| Keyword Only | Stored in Text search DB only, BM25 keyword search |
| Hybrid (Recommended) | Stored in both Vector + Text, combining both search methods |
The search mode is linked to the Knowledge's storage target (configured in Settings). For example, if only Vector DB is enabled for the Knowledge, data will not be stored in Text DB even if Keyword Only mode is selected.
Post-Upload Status Monitoring
After submitting files, documents automatically go through the processing pipeline. You can check the current status of each document in the document list.
Document Status
| Status | Color | Description |
|---|---|---|
| Ready | Blue | Waiting for indexing |
| Processing | Blue (animated) | Parsing/chunking/indexing in progress |
| Completed | Green | Indexing complete |
| Paused | Yellow | Paused |
| Failed | Red | Processing failed |
| Cancelled | Gray | Cancelled by user |
| Reindexing | Blue (animated) | Reindexing in progress |
Documents in active states (Ready, Processing, Reindexing) are automatically polled at 5-second intervals to refresh with the latest status.
Actions by Status
- Processing: Can be stopped (Stop)
- Paused: Can be resumed (Resume) or cancelled (Cancel)
- Completed: Can be reindexed (Reindex)
- Failed: Can check error message and retry
Document Detail Page
Click a completed document to navigate to the detail page at /knowledge/:knowledgeId/documents/files/:docId. On this page you can view:
- Indexing Status Card: Current processing status and progress information
- Chunk List: Browse generated chunks with pagination
- File Download: Download the original file
- Reindexing: Reindex with modified options
Precise mode (ACCURATE table extraction + OCR enabled) takes longer to process but produces more accurate results for PDFs with complex layouts. Default settings are sufficient for simple text documents.
When processing large files (50MB or more), it's recommended to set the document_timeout value higher than the default (90 seconds). Parsing may fail if the timeout is exceeded.
Next Steps
- Manual Documents - Write and register text chunks directly
- Web Crawling - URL-based automatic web page collection
- Chunking and Options - Detailed guide on chunking strategies and advanced options
- Search Test - Verify search quality of indexed documents