Version: v0.1.0

Datasets

Datasets are the core resource for managing structured table data in D.Hub. They internally use the Delta Lake format to provide schema management, version tracking, and transaction support.

Dataset Overview

Datasets are one of the primary item types included in collections, with the following characteristics:

Schema-Based Management: Clearly defines each column's name, data type, and description.
Version Control: A new version is automatically recorded whenever data changes.
Delta Lake-Based: Supports ACID transactions, Time Travel, and Schema Evolution.
S3-Compatible Storage: Data is stored on object storage.

Dataset Types

D.Hub supports two types of datasets.

Type	Description
Delta Dataset	General table dataset based on file upload and pipelines
Kafka Dataset	Streaming dataset that ingests data from real-time message streams

Dataset Detail Screen

When you select a dataset in the collection tree, detailed information is displayed in the right panel.

Schema Tab

The Schema tab lets you view and edit the dataset's structure.

Column List: Each column's name, data type, and description are displayed in table format.
Type Information: Precise Arrow-based data types are shown (e.g., int64, string, timestamp).
Nullable: Check whether each column allows NULL values.
Schema Editing: In edit mode, you can modify column names, types, and descriptions.

info

The schema is automatically inferred during CSV upload and can be manually modified if needed.

Data Tab

The Data tab lets you view and analyze the actual stored data in table format.

Pagination: Browse large volumes of data page by page.
SQL Scratch Pad: A collapsible built-in SQL editor allows you to write queries directly to search and filter data.
Map View: When geographic (latitude/longitude) data is present, a map visualization view is automatically activated.
Chart View: When time series data is detected, a chart visualization view is automatically activated.

Versions Tab

View the dataset's version history in a timeline format.

Version List: Displays each version's modification time and whether it's the latest version.
Version Preview: Select a specific version to view data at that point in time.
Version Restore: Restore a previous version as the current version.

Metadata

View and edit the dataset's metadata at the top of the detail screen.

Field	Description
Name	Unique identifier for the dataset (lowercase letters, numbers, underscores)
Alias	Display name shown to users
Category	Resource classification (default: `dataset`)
Type	Dataset type (user-defined)
Tags	Tag list for search and classification
Comment	Description of the dataset

CSV Upload

D.Hub supports convenient data upload via CSV files.

Upload Procedure

Select File: Click the Upload button on the dataset detail screen and select a CSV file.
Automatic Schema Inference: The system analyzes the CSV file's headers and data to automatically infer each column's data type.
Data Type Review: Review the inferred schema and modify types if needed.
Execute Upload: After confirming, click the Upload button and the data is converted and stored as a Delta Lake table.

tip

The first row of the CSV file must be column headers. Files without headers are automatically assigned default names such as column_0, column_1 during upload.

Upload Notes

File Encoding: UTF-8 encoding is recommended.
Delimiter: Comma (,) is the default delimiter and is auto-detected based on file content.
Data Size: Upload time may increase for large files.

warning

Closing the browser or navigating away during upload may interrupt the upload. Perform other tasks only after confirming the upload completion message.

Supported Data Types

D.Hub datasets support a wide range of Apache Arrow-based data types.

Basic Types

Type	Description	Example
`string`	String (UTF-8)	`"Hello"`, `"Seoul"`
`large_string`	Large string	Long text data
`boolean`	True/False	`true`, `false`

Integer Types

Type	Range	Description
`int8`	-128 to 127	8-bit integer
`int16`	-32,768 to 32,767	16-bit integer
`int32`	-2^31 to 2^31-1	32-bit integer
`int64`	-2^63 to 2^63-1	64-bit integer
`uint8`	0 to 255	Unsigned 8-bit integer
`uint16`	0 to 65,535	Unsigned 16-bit integer
`uint32`	0 to 2^32-1	Unsigned 32-bit integer
`uint64`	0 to 2^64-1	Unsigned 64-bit integer

Floating-Point Types

Type	Precision	Description
`float16`	Half precision	16-bit floating point
`float32`	Single precision	32-bit floating point
`float64`	Double precision	64-bit floating point

Date/Time Types

Type	Description	Example
`date32`	Date (day unit)	`2025-01-15`
`date64`	Date (millisecond unit)	`2025-01-15`
`time32`	Time (seconds/milliseconds)	`14:30:00`
`time64`	Time (microseconds/nanoseconds)	`14:30:00.123456`
`timestamp`	Timestamp (timezone support)	`2025-01-15T14:30:00+09:00`
`duration`	Duration	Elapsed time

Complex Types

Type	Description
`list`	Array of elements of the same type
`struct`	Structure with multiple fields
`binary`	Binary data
`large_binary`	Large binary data

Next Steps

Code Management — Managing code artifacts in collections
Version Control — Resource version management system
Adding Items — Adding datasets to collections

Dataset Overview​

Dataset Types​

Dataset Detail Screen​

Schema Tab​

Data Tab​

Versions Tab​

Metadata​

CSV Upload​

Upload Procedure​

Upload Notes​

Supported Data Types​

Basic Types​

Integer Types​

Floating-Point Types​

Date/Time Types​

Complex Types​

Next Steps​