Version Control
All resources in D.Hub (Dataset, Code, Pipeline, Knowledge) have their change history managed by version. This document explains D.Hub's version management system and how to use it.
Version Control Overview
D.Hub provides two layers of version management.
Manifest-Based Version Control
Metadata and configuration information for all resources are managed as Manifests. Manifests are stored on S3-compatible object storage, and change history is automatically recorded through object versioning.
- When a resource is created, an initial manifest is stored.
- When metadata (name, tags, description, etc.) is modified, a new version of the manifest is created.
- Previous manifest versions are retained and not deleted.
Delta Lake Table Version Control
The actual data of datasets is managed as Delta Lake tables. Delta Lake tracks data versions through its own transaction log.
- New versions are created when CSV uploads, data additions, or schema changes occur.
- Each version is assigned a unique version number and timestamp.
- Time Travel: You can query past data by a specific point in time or version number.
Manifest versions and Delta Lake table versions are managed independently. Manifests track metadata changes of resources, while Delta Lake tracks actual data changes.
Version Control by Resource
Dataset Versions
Datasets have two levels of versioning.
| Version Type | Managed Target | Tracking Method |
|---|---|---|
| Manifest Version | Metadata such as name, schema, tags, description | S3 Object Versioning |
| Table Version | Actual table data (rows/columns) | Delta Lake Transaction Log |
When referencing a dataset as input in a pipeline, you can specify a particular version number to use data from an exact point in time.
Code Versions
Code artifacts are stored version by version in object storage.
- A new version is automatically created when code content is modified.
- You can check the modification time for each version in the version list.
- Selecting two versions lets you compare changes in the Monaco DiffEditor (Side-by-side / Unified mode).
- You can download the code file for a specific version.
Pipeline Versions
Pipeline configuration information (step list, dependencies, options, etc.) is managed as a manifest.
- New versions are created when steps are added/removed or dependencies change.
- The configuration at the time of execution is used when a pipeline runs.
Knowledge Versions
Knowledge resources also have versions managed on a manifest basis. Actions such as adding documents and changing settings are recorded as versions.
Viewing the Version List
Click the Versions tab in a resource's detail screen to view the complete version list for that resource.
Each version entry displays the following information:
| Item | Description |
|---|---|
| Version ID | Unique version identifier |
| Latest | Whether it is the latest version |
| Last Modified | Timestamp when the version was created |
Viewing a Specific Version
Click an entry in the version list to view the resource state at that point in time.
- Dataset: Schema and data preview for that version
- Code: Code content for that version
- Pipeline: Step configuration for that version
Version Restore
You can restore a previous version as the current version. Select the version to restore from the version list and click the Restore button. A confirmation modal is displayed. After confirmation, the state of that version is applied as the latest version.
For datasets, you can select a specific version from the Delta Lake table version list to preview data at that point in time. This allows you to track data change history.
Checkpoint
A Checkpoint is a mechanism that records the execution state of each step during a pipeline's batch processing.
Checkpoint Structure
Each checkpoint contains the following information:
| Field | Description |
|---|---|
| State | Execution state (running, completed, failed, etc.) |
| Offsets | Data processing offsets (for resuming from where it left off on restart) |
| Start Time | Execution start time |
| End Time | Execution end time |
| Comment | Additional notes |
Checkpoint Usage
Checkpoints are useful in the following scenarios:
- Failure Recovery: When an error occurs during pipeline execution, you can restart from the last checkpoint.
- Progress Tracking: Monitor progress through each step's processing offset.
- Execution History Analysis: Analyze past batch execution start/end times, throughput, etc.
Checkpoints are a batch processing-only feature. Checkpoints are not created for single (ad-hoc) pipeline executions.
Next Steps
- Datasets — Dataset schema and data management
- Codes — Code artifact management
- Pipelines (Collection) — Pipeline status review
- Running Pipelines — Batch execution and monitoring