| Batch | A unit of work that processes large amounts of data at once. Progress can be tracked through checkpoints. |
| Checkpoint | A snapshot that records the progress state at a specific point during batch processing. If an error occurs, processing can resume from that point. |
| Chunk | A text fragment created by splitting documents collected in Knowledge into sizes suitable for search and embedding. Depending on the chunking strategy, you can choose from fixed-size, markdown-based, hierarchical splitting, and more. |
| Collection | The top-level container that logically groups related resources such as datasets, codes, pipelines, and knowledge. |
| Collection Item | Refers to an individual resource (dataset, code, pipeline, knowledge) included in a collection. |
| Dataset Version | A version representing the state of a dataset at a specific point in time. All change history is automatically recorded through Delta Lake's transaction log. |
| Embedding | The result of converting text into numerical representations in a high-dimensional vector space. Semantically similar texts have close vector values, enabling similarity-based search. |
| Manifest | A definition file containing metadata and configuration information for resources such as datasets and pipelines. Stored in JSON format on S3 storage. |
| Ontology Entity | A node in the ontology that represents a real-world entity. Defines domain objects such as users, products, and sensors, and can have properties. |
| Ontology Relationship | An edge that defines a semantic connection between ontology entities. Represents relationships such as "owns" or "located at". |
| Pipeline Node | An individual processing step within a pipeline. Performs tasks such as data reading, transformation, filtering, and storage, with data flow constructed through connections between nodes. |
| RAG | Stands for Retrieval-Augmented Generation — a technique that retrieves external documents to augment LLM answer generation. Used in Knowledge's AI Chat. |
| Vector DB | A specialized database that stores data in vector (number array) format and performs similarity-based search. In D.Hub, it is used for storing and searching embedded document chunks. |