Version: v0.1.0

Web Crawling

D.Hub Knowledge's web crawling feature automatically collects website content starting from a specified URL, splits it into chunks, and indexes it into the Knowledge store. It is useful for building knowledge from various web sources such as technical documentation sites, blogs, and wikis.

The crawling engine operates through an AI-based web crawler, supporting both static HTML pages and dynamic pages rendered with JavaScript (SPAs). Collected content is split according to the configured chunking strategy and then indexed into the selected storage among Vector DB, Text search DB, and Graph DB.

Creating a New Web Crawling Job

From the Documents tab on the Knowledge detail screen, select the Web source to navigate to the web crawling job creation page (/knowledge/:knowledgeId/documents/web/new).

Basic Information

Item	Required	Description
Start URL	Required	The URL of the web page to start crawling from (`https://docs.example.com`)
Document Title	Optional	Auto-detected from the page title if not provided
Description	Optional	Notes about the crawling job

Crawling Options

After entering basic information, expand the Crawl Options section to fine-tune the crawling behavior.

Crawling Strategy

Strategy	Description	Best For
BFS (Breadth-First)	Collects pages at the same depth first before moving to the next depth	Evenly collecting the overall site structure (default)
DFS (Depth-First)	Explores one path to the end before moving to the next	Intensively collecting a specific subsection
BEST_FIRST (Relevance-First)	Prioritizes collecting pages with higher relevance based on link importance scores	Efficiently collecting related content

Exploration Scope

Option	Default	Range	Description
Max Depth	3	1~10	Link exploration depth from the start URL
Max Pages	100	1~10,000	Maximum number of pages to collect

Max Pages Setting

If the maximum page count is not set, crawling may take an extremely long time. When crawling large sites, be sure to set an appropriate limit.

Additional Options

Option	Default	Description
Use Sitemap	Enabled	Quickly collect URL lists using `sitemap.xml`
Respect robots.txt	Enabled	Check `robots.txt` rules and skip blocked paths
Exclude External Links	Enabled	Ignore links to domains different from the start URL

Content Processing Options

Options for extracting only useful content from crawled web pages. These correspond to content options in the options reference and are provided as advanced settings in the actual UI.

Option	Description	Default
Execute JS (`execute_js`)	Collect content from dynamically rendered pages by executing JavaScript	Enabled
Content Filter (`content_filter`)	Automatically filter unnecessary content using BM25 or Pruning algorithms	`PRUNING`
CSS Selector (`css_selector`)	Extract only specific elements from the page (e.g., `main`, `article`, `.content`)	Not set
Excluded Tags (`excluded_tags`)	List of HTML tags to exclude from extraction (e.g., `nav`, `footer`, `aside`)	Not set

JS Execution Option Tip

Only enable the JS execution option for sites with SPAs or dynamic content. It consumes unnecessary resources on static sites.

Indexing Options

Select the search mode for storing the collected content.

Search Mode	Description
VECTOR (Semantic Only)	Uses only embedding vector-based semantic similarity search
TEXT (Keyword Only)	Uses only BM25-based keyword full-text search
HYBRID (Recommended)	Hybrid search combining Vector + Text (RRF-based merge)

tip

In most cases, HYBRID mode is most effective. It leverages the advantages of both semantic search and keyword search.

Chunking Options

Settings for splitting crawled content into search-appropriate sizes. These can be adjusted in the Chunking Options section.

Option	Default	Range	Description
Chunking Strategy	`markdown`	-	Select splitting method (`hybrid`, `markdown`, `hierarchical`, `fixed`, `parent_child`)
Max Chunk Length	500	100~4,000	Maximum number of characters in a single chunk
Overlap Length	50	0~500	Number of overlapping characters between adjacent chunks

The default chunking strategy for web crawling is set to markdown, which naturally splits based on the markdown structure (headings, lists, etc.) converted from HTML.

Recommended Overlap Length

An overlap length of 10~25% of the max chunk length is recommended. For example, if the max chunk length is 500, an overlap of 50~125 is appropriate.

Crawling Job Monitoring

Once a crawling job starts, you can monitor progress in real-time on the job detail page (/knowledge/:knowledgeId/documents/web/:jobId).

Job Status

Status	Description
PROCESSING	Crawling is in progress
PAUSED	Paused by user
COMPLETED	Crawling and indexing completed successfully
FAILED	Crawling failed due to an error
CANCELLED	Cancelled by user

Control Buttons by Status

Different control buttons are displayed in the header area depending on the job status.

Current Status	Available Buttons	Action
PROCESSING	Cancel	Immediately stop the crawling in progress
COMPLETED	Re-crawl	Restart crawling with the same settings
FAILED	Re-crawl	Retry the failed crawling

Progress Display

While crawling is in progress, an Indexing Status Card is displayed at the top of the detail page. This card includes the current status and progress message, refreshing automatically at 5-second intervals.

Once crawling is complete, the status card can be dismissed, and the generated chunk list is displayed in table format.

Chunk List

Chunks generated after crawling completion can be viewed in a paginated table.

Column	Description
#	Chunk sequence number
Type	Chunk type (TEXT, IMAGE)
Content	Chunk content preview (2 lines)

Click a row in the table to open a Chunk Detail Drawer, where you can view the full content and navigate to previous/next chunks.

Document Management

On the job detail page, you can perform the following actions on crawling result documents.

Edit Description: Add or modify the description of the document
Delete Document: Permanently delete the document and all associated chunks and embeddings

Deletion Warning

Deleting a document permanently removes all its chunks and embedding data. This action cannot be undone.

Complete Crawling Workflow

Knowledge Detail → Documents Tab → Select Web source
Enter start URL and configure crawling options
Adjust chunking/indexing options if needed
Click Start Crawl to begin crawling
Monitor progress on the job detail page
After completion, review generated chunks and verify quality with Search Test

Recommended Settings for Efficient Crawling

Technical documentation sites: BFS strategy, max depth 3~5, external link exclusion enabled
Blogs/Wikis: BEST_FIRST strategy, max page limit, CSS selector to target the main content area
SPAs (React/Vue-based): JS execution enabled, consider infinite scroll handling

Next Steps

File Upload — Upload PDF, DOCX, and other document files
Manual Documents — Write text chunks directly
Search Test — Verify search quality of collected knowledge
Chunking and Options — Detailed reference for chunking strategies and embedding models

Creating a New Web Crawling Job​

Basic Information​

Crawling Options​

Crawling Strategy​

Exploration Scope​

Additional Options​

Content Processing Options​

Indexing Options​

Chunking Options​

Crawling Job Monitoring​

Job Status​

Control Buttons by Status​

Progress Display​

Chunk List​

Document Management​

Complete Crawling Workflow​

Next Steps​