Skip to main content
Version: v0.1.0

Chunking and Options

Chunking is the process of splitting documents into searchable units (chunks). Selecting the right chunking strategy and options has a critical impact on search quality. This document provides a comprehensive guide to the chunking strategies, parameters, embedding models, per-store behavior, and indexing quality modes available in D.Hub Knowledge.


Chunking Strategies

Algorithms for splitting documents into chunks. Selecting the appropriate strategy based on the document format and purpose can improve search quality.

StrategyDescriptionRecommended Use Case
hybridA combined strategy that merges markdown structure recognition with fixed-size splitting (default)Versatile for most documents
markdownSplits by markdown heading (#, ##, ###, etc.) at the section levelStructured documents (MD, HTML, technical docs)
hierarchicalSplits while preserving the document hierarchy (section → subsection → paragraph)Long academic papers, legal documents, specifications
fixedSplits uniformly by a specified token count. Ignores document structureUnstructured text, logs, conversation records
parent_childCreates pairs of large parent chunks and small child chunks. Search uses child chunks while context is provided from parent chunksCases requiring both detailed search and broad context
Strategy Selection Guide

If you're unsure which strategy to choose, use the default hybrid. It combines structure recognition with the stability of fixed-size splitting, providing good results in most situations.


Chunking Parameters

Parameters that control chunk size and overlap. These apply commonly to all chunking strategies.

Max Chunk Length (max_tokens)

The maximum number of tokens contained in a single chunk.

ItemValue
Default500
Range100 ~ 4,000
  • When small: Chunks are finely split, enabling precise search, but may lack context
  • When large: Chunks include more context, but search precision may decrease

Overlap Length (overlap_length)

The number of overlapping tokens between adjacent chunks. Setting overlap prevents context from being cut off at chunk boundaries.

ItemValue
Default50
Range0 ~ 500
Recommended Overlap Ratio

Overlap should be set to 10~20% of the max chunk length. For example, if the max chunk length is 500, an overlap of 50~100 is appropriate. Too much overlap unnecessarily increases duplicate data, while too little overlap causes boundary context loss.


Embedding Models

Embedding models convert text into high-dimensional vectors, enabling semantic similarity search. They are applied when using the VECTOR search method.

Available Models

The list of available embedding models is dynamically provided based on the server environment. You can check currently available models through the model dropdown in Settings when creating a Knowledge.

Examples of commonly provided models:

ModelDimensionsCharacteristics
multilingual-e5-large1,024Multi-language support, high accuracy (default)
all-MiniLM-L6-v2384Lightweight model, fast processing speed
all-MiniLM-L12-v2384Slightly higher accuracy than L6
all-mpnet-base-v2768High quality for English text
info

Actual available models may vary depending on the D.Hub instance configuration.

Dimensions and Search Quality

Higher dimensions (vector dimensions) can represent text meaning more finely, potentially improving search accuracy. However, storage space and search computation costs increase. In typical usage environments, the default multilingual-e5-large (1,024 dimensions) offers the best balance of quality and performance.

Caution When Changing Models

Changing the embedding model makes it incompatible with existing indexed vectors. When changing models, you must perform a Settings Reset and reindex all documents.


Per-Store Behavior

Knowledge stores data in up to three stores depending on the selected search methods. Each store supports different search methods.

Vector (Vector DB)

ItemDescription
Stored DataEmbedding vectors of chunk text + original text + metadata
Search MethodCosine similarity-based semantic search
Suitable QueriesConceptual questions, synonym/similar expression search, natural language queries
Required SettingsEmbedding model selection required
ItemDescription
Stored DataChunk original text + Inverted Index + metadata
Search MethodBM25 algorithm-based keyword search
Suitable QueriesProper nouns, code names, exact keywords, specific term searches
Required SettingsNo additional settings required (no embedding needed)

Graph (Graph DB)

ItemDescription
Stored DataEntities (nodes) and relationships (edges) extracted from chunks
Search MethodGraph traversal, path search, pattern matching
Suitable QueriesEntity relationship exploration, "What is the relationship between A and B?", connectivity analysis
Required SettingsNo additional settings required (entities/relationships are automatically extracted)

When both VECTOR and TEXT are selected, Hybrid search is available. It merges results from both stores using the Reciprocal Rank Fusion (RRF) algorithm, combining the advantages of semantic search and keyword search.

tip

In most cases, the VECTOR + TEXT combination is most effective. Add GRAPH only when entity relationship exploration is needed.


Indexing Quality Modes

Options that control the trade-off between quality and speed during document indexing.

ModeDescriptionProcessing SpeedSearch Quality
High-QualityPerforms more precise chunking, embedding, and entity extractionSlowHigh
EconomicalSimplifies processing steps for faster indexingFastModerate

Mode Selection Guide

SituationRecommended Mode
Building important knowledge bases (production)High-Quality
Quick testing and prototypingEconomical
Bulk indexing of large document setsEconomical (then reindex key documents later)
Small set of critical documentsHigh-Quality
High-Quality vs Economical

High-Quality mode takes longer for indexing compared to Economical, but produces more natural chunk boundaries and higher embedding quality, improving search result relevance. For production environments with fewer documents where search quality is important, High-Quality is recommended.


Interrelationships Between Options

Chunking, embedding, and storage options are interconnected and together determine the overall search pipeline quality.

Document → [Chunking Strategy + Parameters] → Chunks → [Embedding Model] → Vectors → [Storage]
│ │
└──── Original Text ───────────┘
StageRelated OptionsImpact
SplittingChunking strategy, max_tokens, overlap_lengthDetermines chunk size and context scope
VectorizationEmbedding modelDetermines semantic representation precision and multilingual performance
StorageSearch MethodsDetermines available search modes
QualityIndexing quality modeDetermines overall pipeline precision and speed
Reindexing After Option Changes

Even if you change chunking strategies or parameters, they are not retroactively applied to already indexed documents. Changed options apply only to newly added documents going forward. To apply changed options to existing documents, you must reindex those documents individually or perform a full reset and re-collection from Settings.


Next Steps

  • Settings — Manage search options and metadata
  • File Upload — Apply chunking options during document file upload
  • Web Crawling — Apply chunking options during web crawling
  • Search Test — Verify search quality after option changes