Skip to main content
Version: v0.1.0

Glossary

A glossary of key terms and technical concepts used in D.Hub documentation. Sorted in alphabetical order.


Platform Terms

TermDescription
BatchA unit of work that processes large amounts of data at once. Progress can be tracked through checkpoints.
CheckpointA snapshot that records the progress state at a specific point during batch processing. If an error occurs, processing can resume from that point.
ChunkA text fragment created by splitting documents collected in Knowledge into sizes suitable for search and embedding. Depending on the chunking strategy, you can choose from fixed-size, markdown-based, hierarchical splitting, and more.
CollectionThe top-level container that logically groups related resources such as datasets, codes, pipelines, and knowledge.
Collection ItemRefers to an individual resource (dataset, code, pipeline, knowledge) included in a collection.
Dataset VersionA version representing the state of a dataset at a specific point in time. All change history is automatically recorded through Delta Lake's transaction log.
EmbeddingThe result of converting text into numerical representations in a high-dimensional vector space. Semantically similar texts have close vector values, enabling similarity-based search.
ManifestA definition file containing metadata and configuration information for resources such as datasets and pipelines. Stored in JSON format on S3 storage.
Ontology EntityA node in the ontology that represents a real-world entity. Defines domain objects such as users, products, and sensors, and can have properties.
Ontology RelationshipAn edge that defines a semantic connection between ontology entities. Represents relationships such as "owns" or "located at".
Pipeline NodeAn individual processing step within a pipeline. Performs tasks such as data reading, transformation, filtering, and storage, with data flow constructed through connections between nodes.
RAGStands for Retrieval-Augmented Generation — a technique that retrieves external documents to augment LLM answer generation. Used in Knowledge's AI Chat.
Vector DBA specialized database that stores data in vector (number array) format and performs similarity-based search. In D.Hub, it is used for storing and searching embedded document chunks.

TermDescription
BM25A traditional text search algorithm that calculates relevance based on keyword frequency within documents and document length. Used in Knowledge's text search.
Hybrid SearchA method that combines keyword-based search (BM25) and vector similarity search to improve search accuracy. Uses Reciprocal Rank Fusion (RRF) to merge the two search results.
LLMStands for Large Language Model — a language model trained on large-scale text data. Representative examples include GPT and Claude, and they are used in D.Hub's AI Chat and code generation.
RerankerA post-processing step that re-ranks initial search results using an LLM or cross-encoder model to place documents most relevant to the question at the top.
RetrieverA module that searches for relevant documents in response to user queries. D.Hub supports text search, vector search, and hybrid search.

Next Steps

  • Core Concepts — Understand the relationships between D.Hub components
  • Quick Start — Experience D.Hub's core workflow in 5 minutes
  • API Overview — Check the REST API reference