Web Crawling
D.Hub Knowledge's web crawling feature automatically collects website content starting from a specified URL, splits it into chunks, and indexes it into the Knowledge store. It is useful for building knowledge from various web sources such as technical documentation sites, blogs, and wikis.
The crawling engine operates through an AI-based web crawler, supporting both static HTML pages and dynamic pages rendered with JavaScript (SPAs). Collected content is split according to the configured chunking strategy and then indexed into the selected storage among Vector DB, Text search DB, and Graph DB.
Creating a New Web Crawling Job
From the Documents tab on the Knowledge detail screen, select the Web source to navigate to the web crawling job creation page (/knowledge/:knowledgeId/documents/web/new).
Basic Information
| Item | Required | Description |
|---|---|---|
| Start URL | Required | The URL of the web page to start crawling from (https://docs.example.com) |
| Document Title | Optional | Auto-detected from the page title if not provided |
| Description | Optional | Notes about the crawling job |
Crawling Options
After entering basic information, expand the Crawl Options section to fine-tune the crawling behavior.
Crawling Strategy
| Strategy | Description | Best For |
|---|---|---|
| BFS (Breadth-First) | Collects pages at the same depth first before moving to the next depth | Evenly collecting the overall site structure (default) |
| DFS (Depth-First) | Explores one path to the end before moving to the next | Intensively collecting a specific subsection |
| BEST_FIRST (Relevance-First) | Prioritizes collecting pages with higher relevance based on link importance scores | Efficiently collecting related content |
Exploration Scope
| Option | Default | Range | Description |
|---|---|---|---|
| Max Depth | 3 | 1~10 | Link exploration depth from the start URL |
| Max Pages | 100 | 1~10,000 | Maximum number of pages to collect |
If the maximum page count is not set, crawling may take an extremely long time. When crawling large sites, be sure to set an appropriate limit.
Additional Options
| Option | Default | Description |
|---|---|---|
| Use Sitemap | Enabled | Quickly collect URL lists using sitemap.xml |
| Respect robots.txt | Enabled | Check robots.txt rules and skip blocked paths |
| Exclude External Links | Enabled | Ignore links to domains different from the start URL |
Content Processing Options
Options for extracting only useful content from crawled web pages. These correspond to content options in the options reference and are provided as advanced settings in the actual UI.
| Option | Description | Default |
|---|---|---|
Execute JS (execute_js) | Collect content from dynamically rendered pages by executing JavaScript | Enabled |
Content Filter (content_filter) | Automatically filter unnecessary content using BM25 or Pruning algorithms | PRUNING |
CSS Selector (css_selector) | Extract only specific elements from the page (e.g., main, article, .content) | Not set |
Excluded Tags (excluded_tags) | List of HTML tags to exclude from extraction (e.g., nav, footer, aside) | Not set |
Only enable the JS execution option for sites with SPAs or dynamic content. It consumes unnecessary resources on static sites.
Indexing Options
Select the search mode for storing the collected content.
| Search Mode | Description |
|---|---|
| VECTOR (Semantic Only) | Uses only embedding vector-based semantic similarity search |
| TEXT (Keyword Only) | Uses only BM25-based keyword full-text search |
| HYBRID (Recommended) | Hybrid search combining Vector + Text (RRF-based merge) |
In most cases, HYBRID mode is most effective. It leverages the advantages of both semantic search and keyword search.
Chunking Options
Settings for splitting crawled content into search-appropriate sizes. These can be adjusted in the Chunking Options section.
| Option | Default | Range | Description |
|---|---|---|---|
| Chunking Strategy | markdown | - | Select splitting method (hybrid, markdown, hierarchical, fixed, parent_child) |
| Max Chunk Length | 500 | 100~4,000 | Maximum number of characters in a single chunk |
| Overlap Length | 50 | 0~500 | Number of overlapping characters between adjacent chunks |
The default chunking strategy for web crawling is set to markdown, which naturally splits based on the markdown structure (headings, lists, etc.) converted from HTML.
An overlap length of 10~25% of the max chunk length is recommended. For example, if the max chunk length is 500, an overlap of 50~125 is appropriate.
Crawling Job Monitoring
Once a crawling job starts, you can monitor progress in real-time on the job detail page (/knowledge/:knowledgeId/documents/web/:jobId).
Job Status
| Status | Description |
|---|---|
| PROCESSING | Crawling is in progress |
| PAUSED | Paused by user |
| COMPLETED | Crawling and indexing completed successfully |
| FAILED | Crawling failed due to an error |
| CANCELLED | Cancelled by user |
Control Buttons by Status
Different control buttons are displayed in the header area depending on the job status.
| Current Status | Available Buttons | Action |
|---|---|---|
| PROCESSING | Cancel | Immediately stop the crawling in progress |
| COMPLETED | Re-crawl | Restart crawling with the same settings |
| FAILED | Re-crawl | Retry the failed crawling |
Progress Display
While crawling is in progress, an Indexing Status Card is displayed at the top of the detail page. This card includes the current status and progress message, refreshing automatically at 5-second intervals.
Once crawling is complete, the status card can be dismissed, and the generated chunk list is displayed in table format.
Chunk List
Chunks generated after crawling completion can be viewed in a paginated table.
| Column | Description |
|---|---|
| # | Chunk sequence number |
| Type | Chunk type (TEXT, IMAGE) |
| Content | Chunk content preview (2 lines) |
Click a row in the table to open a Chunk Detail Drawer, where you can view the full content and navigate to previous/next chunks.
Document Management
On the job detail page, you can perform the following actions on crawling result documents.
- Edit Description: Add or modify the description of the document
- Delete Document: Permanently delete the document and all associated chunks and embeddings
Deleting a document permanently removes all its chunks and embedding data. This action cannot be undone.
Complete Crawling Workflow
- Knowledge Detail → Documents Tab → Select Web source
- Enter start URL and configure crawling options
- Adjust chunking/indexing options if needed
- Click Start Crawl to begin crawling
- Monitor progress on the job detail page
- After completion, review generated chunks and verify quality with Search Test
- Technical documentation sites: BFS strategy, max depth 3~5, external link exclusion enabled
- Blogs/Wikis: BEST_FIRST strategy, max page limit, CSS selector to target the main content area
- SPAs (React/Vue-based): JS execution enabled, consider infinite scroll handling
Next Steps
- File Upload — Upload PDF, DOCX, and other document files
- Manual Documents — Write text chunks directly
- Search Test — Verify search quality of collected knowledge
- Chunking and Options — Detailed reference for chunking strategies and embedding models