Skip to main content
Version: v0.1.0

File Upload

File Upload is a feature for indexing various document file formats such as PDF, DOCX into a Knowledge. Uploaded documents are automatically converted into searchable knowledge data through parsing, chunking, and embedding processes.

Supported File Formats

FormatExtensionMax Size
PDF.pdf100MB
Word.docx, .doc100MB
PowerPoint.pptx100MB
Excel.xlsx100MB
CSV.csv100MB
HTML.html100MB
Text.txt100MB
Markdown.md100MB

Upload Screen

The file upload screen can be accessed at the /knowledge/:knowledgeId/documents/files/upload path. Click the Upload Files button from the Documents tab on the Knowledge detail page.

The upload consists of a 3-step wizard:

StepNameDescription
Step 1Upload FilesFile selection and parsing option configuration
Step 2Chunk SettingsChunking strategy and parameter configuration
Step 3Review & SubmitReview indexing options and final submission

Step 1: File Selection

Drag and drop files onto the drop zone or click to select files.

  • Multi-file upload: You can select or drag multiple files at once
  • File removal: Individual files can be removed from the added file list
  • Format filtering: Files with unsupported formats are automatically filtered out

At this step, configure Title Override, Description, and parsing options for the document.

Processing Options

Parsing Options (Step 1)

Parsing options control how text is extracted from documents.

OptionDescriptionDefaultNotes
Table Structure ExtractionExtract tables from documents in structured formatEnabledSwitch on/off
Table Extraction ModeTable extraction accuracy settingFASTFAST or ACCURATE
Use OCREnable text recognition in imagesDisabledSwitch on/off
OCR EngineSelect the OCR engine to useAUTOSee table below
Document TimeoutMaximum wait time for parsing (seconds)90Min 10 ~ Max 600

OCR Engine Options

EngineDescription
AUTOSystem automatically selects the optimal engine
TESSERACTGoogle Tesseract OCR
EASYOCREasyOCR (multi-language support)
RAPIDOCRRapidOCR (lightweight high-speed engine)

Chunking Options (Step 2)

Chunking options determine how extracted text is split into search units (chunks).

OptionDescriptionDefaultRange
Chunking StrategyText splitting algorithmhybridSee table below
Max Chunk LengthMaximum tokens per chunk500100 ~ 4,000
Overlap LengthNumber of overlapping tokens between adjacent chunks500 ~ 500

Chunking Strategies

StrategyDescriptionRecommended Use
HybridCombined structure + size-based splittingGeneral purpose (default, recommended)
MarkdownSplits by markdown headings.md files, structured documents
HierarchicalSplits based on document hierarchyLong documents with clear sections
Fixed SizeUniform splitting by fixed sizeUnstructured text
Parent-ChildHierarchical splitting with parent-child relationshipsDocuments requiring detailed search
Chunk Preview

In Step 2, you can preview chunking results in real-time through the Chunk Preview panel with the current settings. Verify chunk sizes and splitting quality before actual indexing.

Indexing Options (Step 3)

Indexing options configure how chunks are stored and the search mode.

OptionDescriptionDefault
Search ModeSearch method used for indexingHYBRID

Search Modes

ModeDescription
Semantic OnlyStored in Vector DB only, semantic similarity search
Keyword OnlyStored in Text search DB only, BM25 keyword search
Hybrid (Recommended)Stored in both Vector + Text, combining both search methods
Search Mode and Storage

The search mode is linked to the Knowledge's storage target (configured in Settings). For example, if only Vector DB is enabled for the Knowledge, data will not be stored in Text DB even if Keyword Only mode is selected.

Post-Upload Status Monitoring

After submitting files, documents automatically go through the processing pipeline. You can check the current status of each document in the document list.

Document Status

StatusColorDescription
ReadyBlueWaiting for indexing
ProcessingBlue (animated)Parsing/chunking/indexing in progress
CompletedGreenIndexing complete
PausedYellowPaused
FailedRedProcessing failed
CancelledGrayCancelled by user
ReindexingBlue (animated)Reindexing in progress

Documents in active states (Ready, Processing, Reindexing) are automatically polled at 5-second intervals to refresh with the latest status.

Actions by Status

  • Processing: Can be stopped (Stop)
  • Paused: Can be resumed (Resume) or cancelled (Cancel)
  • Completed: Can be reindexed (Reindex)
  • Failed: Can check error message and retry

Document Detail Page

Click a completed document to navigate to the detail page at /knowledge/:knowledgeId/documents/files/:docId. On this page you can view:

  • Indexing Status Card: Current processing status and progress information
  • Chunk List: Browse generated chunks with pagination
  • File Download: Download the original file
  • Reindexing: Reindex with modified options
tip

Precise mode (ACCURATE table extraction + OCR enabled) takes longer to process but produces more accurate results for PDFs with complex layouts. Default settings are sufficient for simple text documents.

Large Files

When processing large files (50MB or more), it's recommended to set the document_timeout value higher than the default (90 seconds). Parsing may fail if the timeout is exceeded.

Next Steps