Version: v0.1.0

Chunking and Options

Chunking is the process of splitting documents into searchable units (chunks). Selecting the right chunking strategy and options has a critical impact on search quality. This document provides a comprehensive guide to the chunking strategies, parameters, embedding models, per-store behavior, and indexing quality modes available in D.Hub Knowledge.

Chunking Strategies

Algorithms for splitting documents into chunks. Selecting the appropriate strategy based on the document format and purpose can improve search quality.

Strategy	Description	Recommended Use Case
hybrid	A combined strategy that merges markdown structure recognition with fixed-size splitting (default)	Versatile for most documents
markdown	Splits by markdown heading (`#`, `##`, `###`, etc.) at the section level	Structured documents (MD, HTML, technical docs)
hierarchical	Splits while preserving the document hierarchy (section → subsection → paragraph)	Long academic papers, legal documents, specifications
fixed	Splits uniformly by a specified token count. Ignores document structure	Unstructured text, logs, conversation records
parent_child	Creates pairs of large parent chunks and small child chunks. Search uses child chunks while context is provided from parent chunks	Cases requiring both detailed search and broad context

Strategy Selection Guide

If you're unsure which strategy to choose, use the default hybrid. It combines structure recognition with the stability of fixed-size splitting, providing good results in most situations.

Chunking Parameters

Parameters that control chunk size and overlap. These apply commonly to all chunking strategies.

Max Chunk Length (max_tokens)

The maximum number of tokens contained in a single chunk.

Item	Value
Default	500
Range	100 ~ 4,000

When small: Chunks are finely split, enabling precise search, but may lack context
When large: Chunks include more context, but search precision may decrease

Overlap Length (overlap_length)

The number of overlapping tokens between adjacent chunks. Setting overlap prevents context from being cut off at chunk boundaries.

Item	Value
Default	50
Range	0 ~ 500

Recommended Overlap Ratio

Overlap should be set to 10~20% of the max chunk length. For example, if the max chunk length is 500, an overlap of 50~100 is appropriate. Too much overlap unnecessarily increases duplicate data, while too little overlap causes boundary context loss.

Embedding Models

Embedding models convert text into high-dimensional vectors, enabling semantic similarity search. They are applied when using the VECTOR search method.

Available Models

The list of available embedding models is dynamically provided based on the server environment. You can check currently available models through the model dropdown in Settings when creating a Knowledge.

Examples of commonly provided models:

Model	Dimensions	Characteristics
multilingual-e5-large	1,024	Multi-language support, high accuracy (default)
all-MiniLM-L6-v2	384	Lightweight model, fast processing speed
all-MiniLM-L12-v2	384	Slightly higher accuracy than L6
all-mpnet-base-v2	768	High quality for English text

info

Actual available models may vary depending on the D.Hub instance configuration.

Dimensions and Search Quality

Higher dimensions (vector dimensions) can represent text meaning more finely, potentially improving search accuracy. However, storage space and search computation costs increase. In typical usage environments, the default multilingual-e5-large (1,024 dimensions) offers the best balance of quality and performance.

Caution When Changing Models

Changing the embedding model makes it incompatible with existing indexed vectors. When changing models, you must perform a Settings Reset and reindex all documents.

Per-Store Behavior

Knowledge stores data in up to three stores depending on the selected search methods. Each store supports different search methods.

Vector (Vector DB)

Item	Description
Stored Data	Embedding vectors of chunk text + original text + metadata
Search Method	Cosine similarity-based semantic search
Suitable Queries	Conceptual questions, synonym/similar expression search, natural language queries
Required Settings	Embedding model selection required

Text (Text Search)

Item	Description
Stored Data	Chunk original text + Inverted Index + metadata
Search Method	BM25 algorithm-based keyword search
Suitable Queries	Proper nouns, code names, exact keywords, specific term searches
Required Settings	No additional settings required (no embedding needed)

Graph (Graph DB)

Item	Description
Stored Data	Entities (nodes) and relationships (edges) extracted from chunks
Search Method	Graph traversal, path search, pattern matching
Suitable Queries	Entity relationship exploration, "What is the relationship between A and B?", connectivity analysis
Required Settings	No additional settings required (entities/relationships are automatically extracted)

When both VECTOR and TEXT are selected, Hybrid search is available. It merges results from both stores using the Reciprocal Rank Fusion (RRF) algorithm, combining the advantages of semantic search and keyword search.

tip

In most cases, the VECTOR + TEXT combination is most effective. Add GRAPH only when entity relationship exploration is needed.

Indexing Quality Modes

Options that control the trade-off between quality and speed during document indexing.

Mode	Description	Processing Speed	Search Quality
High-Quality	Performs more precise chunking, embedding, and entity extraction	Slow	High
Economical	Simplifies processing steps for faster indexing	Fast	Moderate

Mode Selection Guide

Situation	Recommended Mode
Building important knowledge bases (production)	High-Quality
Quick testing and prototyping	Economical
Bulk indexing of large document sets	Economical (then reindex key documents later)
Small set of critical documents	High-Quality

High-Quality vs Economical

High-Quality mode takes longer for indexing compared to Economical, but produces more natural chunk boundaries and higher embedding quality, improving search result relevance. For production environments with fewer documents where search quality is important, High-Quality is recommended.

Interrelationships Between Options

Chunking, embedding, and storage options are interconnected and together determine the overall search pipeline quality.

Document → [Chunking Strategy + Parameters] → Chunks → [Embedding Model] → Vectors → [Storage]
                                   │                              │
                                   └──── Original Text ───────────┘

Stage	Related Options	Impact
Splitting	Chunking strategy, max_tokens, overlap_length	Determines chunk size and context scope
Vectorization	Embedding model	Determines semantic representation precision and multilingual performance
Storage	Search Methods	Determines available search modes
Quality	Indexing quality mode	Determines overall pipeline precision and speed

Reindexing After Option Changes

Even if you change chunking strategies or parameters, they are not retroactively applied to already indexed documents. Changed options apply only to newly added documents going forward. To apply changed options to existing documents, you must reindex those documents individually or perform a full reset and re-collection from Settings.

Next Steps

Settings — Manage search options and metadata
File Upload — Apply chunking options during document file upload
Web Crawling — Apply chunking options during web crawling
Search Test — Verify search quality after option changes

Chunking Strategies​

Chunking Parameters​

Max Chunk Length (max_tokens)​

Overlap Length (overlap_length)​

Embedding Models​

Available Models​

Per-Store Behavior​

Vector (Vector DB)​

Text (Text Search)​

Graph (Graph DB)​

Indexing Quality Modes​

Mode Selection Guide​

Interrelationships Between Options​

Next Steps​