Datasets
Datasets are the core resource for managing structured table data in D.Hub. They internally use the Delta Lake format to provide schema management, version tracking, and transaction support.
Dataset Overview
Datasets are one of the primary item types included in collections, with the following characteristics:
- Schema-Based Management: Clearly defines each column's name, data type, and description.
- Version Control: A new version is automatically recorded whenever data changes.
- Delta Lake-Based: Supports ACID transactions, Time Travel, and Schema Evolution.
- S3-Compatible Storage: Data is stored on object storage.
Dataset Types
D.Hub supports two types of datasets.
| Type | Description |
|---|---|
| Delta Dataset | General table dataset based on file upload and pipelines |
| Kafka Dataset | Streaming dataset that ingests data from real-time message streams |
Dataset Detail Screen
When you select a dataset in the collection tree, detailed information is displayed in the right panel.
Schema Tab
The Schema tab lets you view and edit the dataset's structure.
- Column List: Each column's name, data type, and description are displayed in table format.
- Type Information: Precise Arrow-based data types are shown (e.g.,
int64,string,timestamp). - Nullable: Check whether each column allows NULL values.
- Schema Editing: In edit mode, you can modify column names, types, and descriptions.
The schema is automatically inferred during CSV upload and can be manually modified if needed.
Data Tab
The Data tab lets you view and analyze the actual stored data in table format.
- Pagination: Browse large volumes of data page by page.
- SQL Scratch Pad: A collapsible built-in SQL editor allows you to write queries directly to search and filter data.
- Map View: When geographic (latitude/longitude) data is present, a map visualization view is automatically activated.
- Chart View: When time series data is detected, a chart visualization view is automatically activated.
Versions Tab
View the dataset's version history in a timeline format.
- Version List: Displays each version's modification time and whether it's the latest version.
- Version Preview: Select a specific version to view data at that point in time.
- Version Restore: Restore a previous version as the current version.
Metadata
View and edit the dataset's metadata at the top of the detail screen.
| Field | Description |
|---|---|
| Name | Unique identifier for the dataset (lowercase letters, numbers, underscores) |
| Alias | Display name shown to users |
| Category | Resource classification (default: dataset) |
| Type | Dataset type (user-defined) |
| Tags | Tag list for search and classification |
| Comment | Description of the dataset |
CSV Upload
D.Hub supports convenient data upload via CSV files.
Upload Procedure
- Select File: Click the Upload button on the dataset detail screen and select a CSV file.
- Automatic Schema Inference: The system analyzes the CSV file's headers and data to automatically infer each column's data type.
- Data Type Review: Review the inferred schema and modify types if needed.
- Execute Upload: After confirming, click the Upload button and the data is converted and stored as a Delta Lake table.
The first row of the CSV file must be column headers. Files without headers are automatically assigned default names such as column_0, column_1 during upload.
Upload Notes
- File Encoding: UTF-8 encoding is recommended.
- Delimiter: Comma (
,) is the default delimiter and is auto-detected based on file content. - Data Size: Upload time may increase for large files.
Closing the browser or navigating away during upload may interrupt the upload. Perform other tasks only after confirming the upload completion message.
Supported Data Types
D.Hub datasets support a wide range of Apache Arrow-based data types.
Basic Types
| Type | Description | Example |
|---|---|---|
string | String (UTF-8) | "Hello", "Seoul" |
large_string | Large string | Long text data |
boolean | True/False | true, false |
Integer Types
| Type | Range | Description |
|---|---|---|
int8 | -128 to 127 | 8-bit integer |
int16 | -32,768 to 32,767 | 16-bit integer |
int32 | -2^31 to 2^31-1 | 32-bit integer |
int64 | -2^63 to 2^63-1 | 64-bit integer |
uint8 | 0 to 255 | Unsigned 8-bit integer |
uint16 | 0 to 65,535 | Unsigned 16-bit integer |
uint32 | 0 to 2^32-1 | Unsigned 32-bit integer |
uint64 | 0 to 2^64-1 | Unsigned 64-bit integer |
Floating-Point Types
| Type | Precision | Description |
|---|---|---|
float16 | Half precision | 16-bit floating point |
float32 | Single precision | 32-bit floating point |
float64 | Double precision | 64-bit floating point |
Date/Time Types
| Type | Description | Example |
|---|---|---|
date32 | Date (day unit) | 2025-01-15 |
date64 | Date (millisecond unit) | 2025-01-15 |
time32 | Time (seconds/milliseconds) | 14:30:00 |
time64 | Time (microseconds/nanoseconds) | 14:30:00.123456 |
timestamp | Timestamp (timezone support) | 2025-01-15T14:30:00+09:00 |
duration | Duration | Elapsed time |
Complex Types
| Type | Description |
|---|---|
list | Array of elements of the same type |
struct | Structure with multiple fields |
binary | Binary data |
large_binary | Large binary data |
Next Steps
- Code Management — Managing code artifacts in collections
- Version Control — Resource version management system
- Adding Items — Adding datasets to collections