Chunking and Options
Chunking is the process of splitting documents into searchable units (chunks). Selecting the right chunking strategy and options has a critical impact on search quality. This document provides a comprehensive guide to the chunking strategies, parameters, embedding models, per-store behavior, and indexing quality modes available in D.Hub Knowledge.
Chunking Strategies
Algorithms for splitting documents into chunks. Selecting the appropriate strategy based on the document format and purpose can improve search quality.
| Strategy | Description | Recommended Use Case |
|---|---|---|
| hybrid | A combined strategy that merges markdown structure recognition with fixed-size splitting (default) | Versatile for most documents |
| markdown | Splits by markdown heading (#, ##, ###, etc.) at the section level | Structured documents (MD, HTML, technical docs) |
| hierarchical | Splits while preserving the document hierarchy (section → subsection → paragraph) | Long academic papers, legal documents, specifications |
| fixed | Splits uniformly by a specified token count. Ignores document structure | Unstructured text, logs, conversation records |
| parent_child | Creates pairs of large parent chunks and small child chunks. Search uses child chunks while context is provided from parent chunks | Cases requiring both detailed search and broad context |
If you're unsure which strategy to choose, use the default hybrid. It combines structure recognition with the stability of fixed-size splitting, providing good results in most situations.
Chunking Parameters
Parameters that control chunk size and overlap. These apply commonly to all chunking strategies.
Max Chunk Length (max_tokens)
The maximum number of tokens contained in a single chunk.
| Item | Value |
|---|---|
| Default | 500 |
| Range | 100 ~ 4,000 |
- When small: Chunks are finely split, enabling precise search, but may lack context
- When large: Chunks include more context, but search precision may decrease
Overlap Length (overlap_length)
The number of overlapping tokens between adjacent chunks. Setting overlap prevents context from being cut off at chunk boundaries.
| Item | Value |
|---|---|
| Default | 50 |
| Range | 0 ~ 500 |
Overlap should be set to 10~20% of the max chunk length. For example, if the max chunk length is 500, an overlap of 50~100 is appropriate. Too much overlap unnecessarily increases duplicate data, while too little overlap causes boundary context loss.
Embedding Models
Embedding models convert text into high-dimensional vectors, enabling semantic similarity search. They are applied when using the VECTOR search method.
Available Models
The list of available embedding models is dynamically provided based on the server environment. You can check currently available models through the model dropdown in Settings when creating a Knowledge.
Examples of commonly provided models:
| Model | Dimensions | Characteristics |
|---|---|---|
| multilingual-e5-large | 1,024 | Multi-language support, high accuracy (default) |
| all-MiniLM-L6-v2 | 384 | Lightweight model, fast processing speed |
| all-MiniLM-L12-v2 | 384 | Slightly higher accuracy than L6 |
| all-mpnet-base-v2 | 768 | High quality for English text |
Actual available models may vary depending on the D.Hub instance configuration.
Higher dimensions (vector dimensions) can represent text meaning more finely, potentially improving search accuracy. However, storage space and search computation costs increase. In typical usage environments, the default multilingual-e5-large (1,024 dimensions) offers the best balance of quality and performance.
Changing the embedding model makes it incompatible with existing indexed vectors. When changing models, you must perform a Settings Reset and reindex all documents.
Per-Store Behavior
Knowledge stores data in up to three stores depending on the selected search methods. Each store supports different search methods.
Vector (Vector DB)
| Item | Description |
|---|---|
| Stored Data | Embedding vectors of chunk text + original text + metadata |
| Search Method | Cosine similarity-based semantic search |
| Suitable Queries | Conceptual questions, synonym/similar expression search, natural language queries |
| Required Settings | Embedding model selection required |
Text (Text Search)
| Item | Description |
|---|---|
| Stored Data | Chunk original text + Inverted Index + metadata |
| Search Method | BM25 algorithm-based keyword search |
| Suitable Queries | Proper nouns, code names, exact keywords, specific term searches |
| Required Settings | No additional settings required (no embedding needed) |
Graph (Graph DB)
| Item | Description |
|---|---|
| Stored Data | Entities (nodes) and relationships (edges) extracted from chunks |
| Search Method | Graph traversal, path search, pattern matching |
| Suitable Queries | Entity relationship exploration, "What is the relationship between A and B?", connectivity analysis |
| Required Settings | No additional settings required (entities/relationships are automatically extracted) |
When both VECTOR and TEXT are selected, Hybrid search is available. It merges results from both stores using the Reciprocal Rank Fusion (RRF) algorithm, combining the advantages of semantic search and keyword search.
In most cases, the VECTOR + TEXT combination is most effective. Add GRAPH only when entity relationship exploration is needed.
Indexing Quality Modes
Options that control the trade-off between quality and speed during document indexing.
| Mode | Description | Processing Speed | Search Quality |
|---|---|---|---|
| High-Quality | Performs more precise chunking, embedding, and entity extraction | Slow | High |
| Economical | Simplifies processing steps for faster indexing | Fast | Moderate |
Mode Selection Guide
| Situation | Recommended Mode |
|---|---|
| Building important knowledge bases (production) | High-Quality |
| Quick testing and prototyping | Economical |
| Bulk indexing of large document sets | Economical (then reindex key documents later) |
| Small set of critical documents | High-Quality |
High-Quality mode takes longer for indexing compared to Economical, but produces more natural chunk boundaries and higher embedding quality, improving search result relevance. For production environments with fewer documents where search quality is important, High-Quality is recommended.
Interrelationships Between Options
Chunking, embedding, and storage options are interconnected and together determine the overall search pipeline quality.
Document → [Chunking Strategy + Parameters] → Chunks → [Embedding Model] → Vectors → [Storage]
│ │
└──── Original Text ───────────┘
| Stage | Related Options | Impact |
|---|---|---|
| Splitting | Chunking strategy, max_tokens, overlap_length | Determines chunk size and context scope |
| Vectorization | Embedding model | Determines semantic representation precision and multilingual performance |
| Storage | Search Methods | Determines available search modes |
| Quality | Indexing quality mode | Determines overall pipeline precision and speed |
Even if you change chunking strategies or parameters, they are not retroactively applied to already indexed documents. Changed options apply only to newly added documents going forward. To apply changed options to existing documents, you must reindex those documents individually or perform a full reset and re-collection from Settings.
Next Steps
- Settings — Manage search options and metadata
- File Upload — Apply chunking options during document file upload
- Web Crawling — Apply chunking options during web crawling
- Search Test — Verify search quality after option changes