Skip to main content

Version: v0.1.0

Glossary

A glossary of key terms and technical concepts used in D.Hub documentation. Sorted in alphabetical order.

Platform Terms

Term	Description
Batch	A unit of work that processes large amounts of data at once. Progress can be tracked through checkpoints.
Checkpoint	A snapshot that records the progress state at a specific point during batch processing. If an error occurs, processing can resume from that point.
Chunk	A text fragment created by splitting documents collected in Knowledge into sizes suitable for search and embedding. Depending on the chunking strategy, you can choose from fixed-size, markdown-based, hierarchical splitting, and more.
Collection	The top-level container that logically groups related resources such as datasets, codes, pipelines, and knowledge.
Collection Item	Refers to an individual resource (dataset, code, pipeline, knowledge) included in a collection.
Dataset Version	A version representing the state of a dataset at a specific point in time. All change history is automatically recorded through Delta Lake's transaction log.
Embedding	The result of converting text into numerical representations in a high-dimensional vector space. Semantically similar texts have close vector values, enabling similarity-based search.
Manifest	A definition file containing metadata and configuration information for resources such as datasets and pipelines. Stored in JSON format on S3 storage.
Ontology Entity	A node in the ontology that represents a real-world entity. Defines domain objects such as users, products, and sensors, and can have properties.
Ontology Relationship	An edge that defines a semantic connection between ontology entities. Represents relationships such as "owns" or "located at".
Pipeline Node	An individual processing step within a pipeline. Performs tasks such as data reading, transformation, filtering, and storage, with data flow constructed through connections between nodes.
RAG	Stands for Retrieval-Augmented Generation — a technique that retrieves external documents to augment LLM answer generation. Used in Knowledge's AI Chat.
Vector DB	A specialized database that stores data in vector (number array) format and performs similarity-based search. In D.Hub, it is used for storing and searching embedded document chunks.

Term	Description
BM25	A traditional text search algorithm that calculates relevance based on keyword frequency within documents and document length. Used in Knowledge's text search.
Hybrid Search	A method that combines keyword-based search (BM25) and vector similarity search to improve search accuracy. Uses Reciprocal Rank Fusion (RRF) to merge the two search results.
LLM	Stands for Large Language Model — a language model trained on large-scale text data. Representative examples include GPT and Claude, and they are used in D.Hub's AI Chat and code generation.
Reranker	A post-processing step that re-ranks initial search results using an LLM or cross-encoder model to place documents most relevant to the question at the top.
Retriever	A module that searches for relevant documents in response to user queries. D.Hub supports text search, vector search, and hybrid search.

Next Steps

Core Concepts — Understand the relationships between D.Hub components
Quick Start — Experience D.Hub's core workflow in 5 minutes
API Overview — Check the REST API reference

Platform Terms
AI-Related Terms
Next Steps