Skip to main content
Version: v0.1.0

Web Crawling

D.Hub Knowledge's web crawling feature automatically collects website content starting from a specified URL, splits it into chunks, and indexes it into the Knowledge store. It is useful for building knowledge from various web sources such as technical documentation sites, blogs, and wikis.

The crawling engine operates through an AI-based web crawler, supporting both static HTML pages and dynamic pages rendered with JavaScript (SPAs). Collected content is split according to the configured chunking strategy and then indexed into the selected storage among Vector DB, Text search DB, and Graph DB.


Creating a New Web Crawling Job

From the Documents tab on the Knowledge detail screen, select the Web source to navigate to the web crawling job creation page (/knowledge/:knowledgeId/documents/web/new).

Basic Information

ItemRequiredDescription
Start URLRequiredThe URL of the web page to start crawling from (https://docs.example.com)
Document TitleOptionalAuto-detected from the page title if not provided
DescriptionOptionalNotes about the crawling job

Crawling Options

After entering basic information, expand the Crawl Options section to fine-tune the crawling behavior.

Crawling Strategy

StrategyDescriptionBest For
BFS (Breadth-First)Collects pages at the same depth first before moving to the next depthEvenly collecting the overall site structure (default)
DFS (Depth-First)Explores one path to the end before moving to the nextIntensively collecting a specific subsection
BEST_FIRST (Relevance-First)Prioritizes collecting pages with higher relevance based on link importance scoresEfficiently collecting related content

Exploration Scope

OptionDefaultRangeDescription
Max Depth31~10Link exploration depth from the start URL
Max Pages1001~10,000Maximum number of pages to collect
Max Pages Setting

If the maximum page count is not set, crawling may take an extremely long time. When crawling large sites, be sure to set an appropriate limit.

Additional Options

OptionDefaultDescription
Use SitemapEnabledQuickly collect URL lists using sitemap.xml
Respect robots.txtEnabledCheck robots.txt rules and skip blocked paths
Exclude External LinksEnabledIgnore links to domains different from the start URL

Content Processing Options

Options for extracting only useful content from crawled web pages. These correspond to content options in the options reference and are provided as advanced settings in the actual UI.

OptionDescriptionDefault
Execute JS (execute_js)Collect content from dynamically rendered pages by executing JavaScriptEnabled
Content Filter (content_filter)Automatically filter unnecessary content using BM25 or Pruning algorithmsPRUNING
CSS Selector (css_selector)Extract only specific elements from the page (e.g., main, article, .content)Not set
Excluded Tags (excluded_tags)List of HTML tags to exclude from extraction (e.g., nav, footer, aside)Not set
JS Execution Option Tip

Only enable the JS execution option for sites with SPAs or dynamic content. It consumes unnecessary resources on static sites.


Indexing Options

Select the search mode for storing the collected content.

Search ModeDescription
VECTOR (Semantic Only)Uses only embedding vector-based semantic similarity search
TEXT (Keyword Only)Uses only BM25-based keyword full-text search
HYBRID (Recommended)Hybrid search combining Vector + Text (RRF-based merge)
tip

In most cases, HYBRID mode is most effective. It leverages the advantages of both semantic search and keyword search.


Chunking Options

Settings for splitting crawled content into search-appropriate sizes. These can be adjusted in the Chunking Options section.

OptionDefaultRangeDescription
Chunking Strategymarkdown-Select splitting method (hybrid, markdown, hierarchical, fixed, parent_child)
Max Chunk Length500100~4,000Maximum number of characters in a single chunk
Overlap Length500~500Number of overlapping characters between adjacent chunks

The default chunking strategy for web crawling is set to markdown, which naturally splits based on the markdown structure (headings, lists, etc.) converted from HTML.

Recommended Overlap Length

An overlap length of 10~25% of the max chunk length is recommended. For example, if the max chunk length is 500, an overlap of 50~125 is appropriate.


Crawling Job Monitoring

Once a crawling job starts, you can monitor progress in real-time on the job detail page (/knowledge/:knowledgeId/documents/web/:jobId).

Job Status

StatusDescription
PROCESSINGCrawling is in progress
PAUSEDPaused by user
COMPLETEDCrawling and indexing completed successfully
FAILEDCrawling failed due to an error
CANCELLEDCancelled by user

Control Buttons by Status

Different control buttons are displayed in the header area depending on the job status.

Current StatusAvailable ButtonsAction
PROCESSINGCancelImmediately stop the crawling in progress
COMPLETEDRe-crawlRestart crawling with the same settings
FAILEDRe-crawlRetry the failed crawling

Progress Display

While crawling is in progress, an Indexing Status Card is displayed at the top of the detail page. This card includes the current status and progress message, refreshing automatically at 5-second intervals.

Once crawling is complete, the status card can be dismissed, and the generated chunk list is displayed in table format.

Chunk List

Chunks generated after crawling completion can be viewed in a paginated table.

ColumnDescription
#Chunk sequence number
TypeChunk type (TEXT, IMAGE)
ContentChunk content preview (2 lines)

Click a row in the table to open a Chunk Detail Drawer, where you can view the full content and navigate to previous/next chunks.


Document Management

On the job detail page, you can perform the following actions on crawling result documents.

  • Edit Description: Add or modify the description of the document
  • Delete Document: Permanently delete the document and all associated chunks and embeddings
Deletion Warning

Deleting a document permanently removes all its chunks and embedding data. This action cannot be undone.


Complete Crawling Workflow

  1. Knowledge Detail → Documents Tab → Select Web source
  2. Enter start URL and configure crawling options
  3. Adjust chunking/indexing options if needed
  4. Click Start Crawl to begin crawling
  5. Monitor progress on the job detail page
  6. After completion, review generated chunks and verify quality with Search Test
Recommended Settings for Efficient Crawling
  • Technical documentation sites: BFS strategy, max depth 3~5, external link exclusion enabled
  • Blogs/Wikis: BEST_FIRST strategy, max page limit, CSS selector to target the main content area
  • SPAs (React/Vue-based): JS execution enabled, consider infinite scroll handling

Next Steps