Skip to main content
Version: v0.1.0

Datasets

Datasets are the core resource for managing structured table data in D.Hub. They internally use the Delta Lake format to provide schema management, version tracking, and transaction support.

Dataset Overview

Datasets are one of the primary item types included in collections, with the following characteristics:

  • Schema-Based Management: Clearly defines each column's name, data type, and description.
  • Version Control: A new version is automatically recorded whenever data changes.
  • Delta Lake-Based: Supports ACID transactions, Time Travel, and Schema Evolution.
  • S3-Compatible Storage: Data is stored on object storage.

Dataset Types

D.Hub supports two types of datasets.

TypeDescription
Delta DatasetGeneral table dataset based on file upload and pipelines
Kafka DatasetStreaming dataset that ingests data from real-time message streams

Dataset Detail Screen

When you select a dataset in the collection tree, detailed information is displayed in the right panel.

Schema Tab

The Schema tab lets you view and edit the dataset's structure.

  • Column List: Each column's name, data type, and description are displayed in table format.
  • Type Information: Precise Arrow-based data types are shown (e.g., int64, string, timestamp).
  • Nullable: Check whether each column allows NULL values.
  • Schema Editing: In edit mode, you can modify column names, types, and descriptions.
info

The schema is automatically inferred during CSV upload and can be manually modified if needed.

Data Tab

The Data tab lets you view and analyze the actual stored data in table format.

  • Pagination: Browse large volumes of data page by page.
  • SQL Scratch Pad: A collapsible built-in SQL editor allows you to write queries directly to search and filter data.
  • Map View: When geographic (latitude/longitude) data is present, a map visualization view is automatically activated.
  • Chart View: When time series data is detected, a chart visualization view is automatically activated.

Versions Tab

View the dataset's version history in a timeline format.

  • Version List: Displays each version's modification time and whether it's the latest version.
  • Version Preview: Select a specific version to view data at that point in time.
  • Version Restore: Restore a previous version as the current version.

Metadata

View and edit the dataset's metadata at the top of the detail screen.

FieldDescription
NameUnique identifier for the dataset (lowercase letters, numbers, underscores)
AliasDisplay name shown to users
CategoryResource classification (default: dataset)
TypeDataset type (user-defined)
TagsTag list for search and classification
CommentDescription of the dataset

CSV Upload

D.Hub supports convenient data upload via CSV files.

Upload Procedure

  1. Select File: Click the Upload button on the dataset detail screen and select a CSV file.
  2. Automatic Schema Inference: The system analyzes the CSV file's headers and data to automatically infer each column's data type.
  3. Data Type Review: Review the inferred schema and modify types if needed.
  4. Execute Upload: After confirming, click the Upload button and the data is converted and stored as a Delta Lake table.
tip

The first row of the CSV file must be column headers. Files without headers are automatically assigned default names such as column_0, column_1 during upload.

Upload Notes

  • File Encoding: UTF-8 encoding is recommended.
  • Delimiter: Comma (,) is the default delimiter and is auto-detected based on file content.
  • Data Size: Upload time may increase for large files.
warning

Closing the browser or navigating away during upload may interrupt the upload. Perform other tasks only after confirming the upload completion message.

Supported Data Types

D.Hub datasets support a wide range of Apache Arrow-based data types.

Basic Types

TypeDescriptionExample
stringString (UTF-8)"Hello", "Seoul"
large_stringLarge stringLong text data
booleanTrue/Falsetrue, false

Integer Types

TypeRangeDescription
int8-128 to 1278-bit integer
int16-32,768 to 32,76716-bit integer
int32-2^31 to 2^31-132-bit integer
int64-2^63 to 2^63-164-bit integer
uint80 to 255Unsigned 8-bit integer
uint160 to 65,535Unsigned 16-bit integer
uint320 to 2^32-1Unsigned 32-bit integer
uint640 to 2^64-1Unsigned 64-bit integer

Floating-Point Types

TypePrecisionDescription
float16Half precision16-bit floating point
float32Single precision32-bit floating point
float64Double precision64-bit floating point

Date/Time Types

TypeDescriptionExample
date32Date (day unit)2025-01-15
date64Date (millisecond unit)2025-01-15
time32Time (seconds/milliseconds)14:30:00
time64Time (microseconds/nanoseconds)14:30:00.123456
timestampTimestamp (timezone support)2025-01-15T14:30:00+09:00
durationDurationElapsed time

Complex Types

TypeDescription
listArray of elements of the same type
structStructure with multiple fields
binaryBinary data
large_binaryLarge binary data

Next Steps