Ingest Data from Content Sources

Content Sources enable Search AI to ingest data from websites, documents, and third-party applications, creating a unified knowledge base for generating accurate answers. When content is ingested, Search AI automatically initiates training to create the answer index using configured extraction strategies.

Websites

Crawl and index web pages from any public or authenticated website. Navigate to Content > Websites to manage web sources.

Adding a Web Crawler

Click +Web Crawl and configure the following:

Field	Description
Source Title	Unique name for the web source
Description	Information about the source and its purpose
Crawl Source	URL, Upload Sitemap (CSV), or Upload URL (CSV)
Crawl Depth	Levels to traverse (1–10, default: 5)
Max URL Limit	Maximum URLs to crawl (1–10,000, default: 10)

Advanced Crawl Options

Option	Purpose
Use Cookies	Crawl pages requiring cookie acceptance
JavaScript Rendered	Capture dynamically rendered content (set crawl delay for full rendering)
Crawl Beyond Sitemap	Include URLs not listed in the sitemap
Respect robots.txt	Honor crawler directives in robots.txt
Automatic Cleaning	Remove headers, footers, and head tags before ingestion
Retain Original	Ingest complete HTML content for custom transformation via Workbench

URL Filtering

Control which pages are crawled using rules:

Crawl everything — All discovered URLs
Crawl everything except — Block URLs matching specified conditions
Crawl only specific URLs — Allow only URLs matching specified conditions

Conditions include: equals, not equal, contains, does not contain, begins with, ends with.

Authentication

Search AI supports two authentication methods for protected websites: Basic HTTP Authentication

Username/email and password
Optional authorization fields (headers, payload, query string, path parameters)
Authentication URL (may differ from source URL)
Test type: Text Presence, Redirection, or Status Code

Form-based Authentication

Form fields with key, type, and value
Session maintained after initial authentication
Same test options as Basic Auth

Managing Web Sources

Action	How To
View crawled pages	Go to source > Pages tab (shows Successful, Failed, Skipped)
View execution logs	Go to source > Executions tab
Recrawl entire source	Click Recrawl action or use Re-Crawl button on configuration page
Recrawl specific page	Pages tab > Actions > Recrawl
Schedule recrawl	Set date, time, and frequency in crawl configuration
Delete source	Click Delete action (removes all indexed content)

Troubleshooting

Issue	Potential Causes	Resolution
Crawl failure	Permissions, credentials, invalid URL, domain not whitelisted	Verify access, credentials, and URL format
Crawl succeeds but no indexing	JS-rendered pages, incorrect include/exclude rules, undiscoverable pages	Enable JavaScript Rendered, check rules, verify sitemap
Crawl takes too long	Large content volume, JS pages returning no content	Adjust max URLs, enable JavaScript Rendered

Documents

Upload and index files directly to Search AI. Navigate to Content > Documents to manage uploaded files.

Supported Formats

PDF, DOCX, PPT, TXT (scanned PDFs and password-protected files not supported)

Upload Options

Method	Description
File	Upload one or more files to an existing directory
URL	Upload a file from a remote URL with title and description
Directory	Upload a complete folder from local device

Upload Limits

Limit	Default Value
File size	15 MB maximum per file
Batch upload	40 files at once
Directory upload	20 files maximum

Contact support to increase these limits.

Managing Documents

View files: Click any directory to see its files with type and page count
View file details: Click a file for preview, metadata, and JSON view
Delete file: Actions > Delete (from directory or file details page)
Delete directory: Actions > Delete (removes directory and all files)
Search: Use search bar on Directory page or within directory details

Connectors

Pre-built integrations for 60+ third-party applications. Navigate to Content > Connectors to configure integrations.

Connector Setup Workflow

Authentication — Provide OAuth credentials, API keys, or tokens
Ingestion — Select content types and apply filters
Field Mapping — Align source fields with Search AI schema (optional)
Permissions — Configure RACL settings

Configuration Options

Content Types: Most connectors support multiple content types (pages, articles, tasks, tickets, documents). Select which types to ingest under the Ingestion section. Filters: Some connectors support selective ingestion by timeframe, category, user assignment, or other criteria. Field Mapping: Customize how source fields map to Search AI’s schema using custom scripts for data transformation.

Sync Operations

Type	Description
Manual sync	Click Sync Now on Configuration tab
Scheduled sync	Set automatic intervals (files >15MB are skipped)
Permission sync	Separate scheduler for RACL updates

Managing Connectors

Enable/Disable: Toggle connector without deleting data (disables sync temporarily)
View content: Content tab lists ingested items with URLs and timestamps
Remove connector: Connection Settings > Remove this content source (deletes all indexed data)

Custom Connector

For applications without a pre-built connector, the Custom Connector uses REST APIs via a middleware service. Setup Steps:

Download the reference implementation from Search AI
Configure config.json with application details and content fields
Set authentication in .env file
Host the service and configure endpoint in Search AI
Add headers (Authorization required for default implementation)

The service processes content in batches of 30 documents. RACL support available via sys_racl field in config.

JSON Connector

Upload structured data in JSON format for indexing.

Upload up to 10 files at once
Maximum 15 MB per file
Add to existing source or create new

Supported Connectors

Search AI provides out-of-the-box support for ingesting data from a range of third-party repositories. If you want to use a repository not listed below, contact us. RACL: Auto — permissions synced automatically from source; Manual — RACL available, manual entity management required; No — not supported.

Connector	Repository	Supported Content	Filtering	RACL
Aha!	Cloud	Ideas, Features	No	Auto
Airtable	Cloud	Spreadsheets, Databases	No	Manual
Amazon S3	Cloud	.pdf, .ppt, .txt, .docx	No	No
Asana	Cloud	Projects, Tasks	No	Auto
Axero	Cloud	Pages, Wiki, Discussions, Documents, Articles, Announcements, Blogs	Yes	Auto
Azure Storage	Cloud	.txt, .pdf, .rtf, .doc, .docx, .ppt, .pptx	No	Manual
BitBucket	Cloud	Pull Requests	No	Auto
Box	Cloud	.pdf, .docx, .txt	No	Auto
Clickup	Cloud	Tasks, Subtasks, Checklists	No	Manual
Coda Docs	Cloud	Pages	No	Manual
Confluence Cloud	Cloud	Knowledge Articles	Yes	Auto
Confluence Server	On-prem	Knowledge Articles	Yes	Auto
Custom Connector	Cloud	—	No	Manual
Datadog	Cloud	Metrics, Dashboards, Monitors	Yes	Manual
DotCMS	Cloud	—	Yes	Manual
Dropbox	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt, .html	No	Manual
Egnyte	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt	No	Manual
Figma	Cloud	Figma Files	No	No
FreshDesk	Cloud	Tickets	No	Auto
Freshservice	Cloud	Solution Articles	No	Manual
Front	Cloud	Knowledge Base Articles	No	Manual
GitHub	Cloud	Issues, README, Pull Requests	Yes	Auto
GitHub On-Prem	On-prem	Issues, Pull Requests, Pages, Files, Commits	Yes	Manual
GitLab	Cloud	Issues	No	Manual
Google Drive	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt, .html	Yes	Auto
Guru	Cloud	Cards	No	Auto
HelpScout	Cloud	Articles	No	Auto
Hive	Cloud	Actions, Sub-actions	No	Auto
HubSpot	Cloud	Tickets, Deals, Companies, Contacts	No	Auto
Jenkins	Cloud	Dashboard, Jobs, Builds, Plugins	No	No
JFrog Artifactory	Cloud	Artifacts	No	Auto
Jira	Cloud	Issues, Filters, Dashboards	No	Auto
Jira On-Prem	On-prem	Work Items	Yes	Auto
JSON Connector	—	Structured JSON Data	No	No
LumApps ¹	Cloud	Pages, News, Custom Objects, Community Posts	Yes	Auto
Miro	Cloud	Boards	No	Auto
Monday	Cloud	Board Items	No	Manual
MS Teams	Cloud	Channel Conversations	No	Auto
Notion	Cloud	Pages	No	Manual
OneDrive	Cloud	.aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf	No	Auto
Opsgenie	Cloud	Alerts, Incidents	No	Manual
Oracle Knowledge	Cloud	Knowledge Articles	No	No
PagerDuty	Cloud	Schedules, Escalation Policies	No	Auto
Re:amaze	Cloud	Articles	No	Auto
Salesforce	Cloud	Knowledge Articles	Yes	No
ServiceNow	Cloud	Incidents, Service Catalog, Knowledge Articles	Yes	Auto
SharePoint	Cloud	.aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf	Yes	Auto
Shopify	Cloud	Articles, Product Catalog	No	Manual
Shortcut	Cloud	Stories	No	Auto
Slab	Cloud	Posts	No	Manual
Slack	Cloud	Messages	No	Auto
Teamwork	Cloud	Tasks	No	Manual
TestRail	Cloud	Test Cases	No	Manual
Trello	Cloud	Boards, Cards	No	Auto
WordPress	Cloud	Pages, Posts	No	Manual
Workday	Cloud	HR Org Charts, Employee Details	No	Auto
Wrike	Cloud	Tasks	No	Manual
xMatters	Cloud	Incidents, Events, On-Calls	No	Auto
YouTrack	Cloud	Projects, Issues, Knowledge Articles	No	Auto
Zendesk	Cloud	Knowledge Articles, Tickets	No	Auto
Zeplin	Cloud	Screens	No	Auto
Zoho	Cloud	Leads, Accounts, Contacts, Deals	No	Manual
Zoom	Cloud	Meeting Summaries	Yes	Auto
Zulip	Cloud	Messages	No	Auto

¹ LumApps supports attachment ingestion. For any connectors not listed above, use the Custom Connector or contact Kore.ai.

Role-based Access Control (RACL)

RACL ensures users only see content they’re authorized to access by synchronizing permissions from source applications.

How RACL Works

Ingestion: Search AI imports permissions (users, groups, criteria) along with content
Storage: Access information stored in the sys_racl field of each chunk
Access Control: Only users whose identities appear in sys_racl can view answers from that content

Configuring RACL

Go to Permissions page of the connector and select:

Option	Behavior
Same users as in source	Syncs permissions from source application (Restricted Access)
Everyone (Public Access)	Content accessible to all Search AI users (sets `sys_racl` to `*`)

Permission Entities

Permission entities represent groups or user criteria from source applications. Search AI handles these in two ways: Automatic Resolution (supported by some connectors):

Fetches group membership data from source
Maintains up-to-date user-group mappings
Associates users with permission entities automatically

Manual Resolution (other connectors):

Use Permission Entity APIs to manage user associations
Required for connectors without automatic resolution support

Verifying Permissions

Go to Content page
Open JSON view for any file
Check the sys_racl field contents

The same information appears in individual chunks via Content Browser.

Permission Sync Scheduling

Access permissions often change more frequently than content. Configure separate sync schedules:

Open Permissions section of connector
Enable Permissions Sync Scheduled
Set time and frequency

The Sync Scope column shows “Permission Sync” for RACL-specific operations.

RACL Limitations

Supported for connector content only (not websites or uploaded documents)
Switching from Restricted to Public Access automatically disables RACL

Quick Reference

Task	Location
Add website	Content > Websites > +Web Crawl
Upload files	Content > Documents > Upload
Add connector	Content > Connectors > Select connector
View indexed content	Index > Content Browser
Configure extraction	Index > Extraction
Test answers	Configuration > Testing

Modules

Platform Services

References

Ingest Data from Content Sources

Websites

Adding a Web Crawler

Advanced Crawl Options

URL Filtering

Authentication

Managing Web Sources

Troubleshooting

Documents

Supported Formats

Upload Options

Upload Limits

Managing Documents

Connectors

Connector Setup Workflow

Configuration Options

Sync Operations

Managing Connectors

Custom Connector

JSON Connector

Supported Connectors

Role-based Access Control (RACL)

How RACL Works

Configuring RACL

Permission Entities

Verifying Permissions

Permission Sync Scheduling

RACL Limitations

Quick Reference

Modules

Platform Services

References

​Websites

​Adding a Web Crawler

​Advanced Crawl Options

​URL Filtering

​Authentication

​Managing Web Sources

​Troubleshooting

​Documents

​Supported Formats

​Upload Options

​Upload Limits

​Managing Documents

​Connectors

​Connector Setup Workflow

​Configuration Options

​Sync Operations

​Managing Connectors

​Custom Connector

​JSON Connector

​Supported Connectors

​Role-based Access Control (RACL)

​How RACL Works

​Configuring RACL

​Permission Entities

​Verifying Permissions

​Permission Sync Scheduling

​RACL Limitations

​Quick Reference

Websites

Adding a Web Crawler

Advanced Crawl Options

URL Filtering

Authentication

Managing Web Sources

Troubleshooting

Documents

Supported Formats

Upload Options

Upload Limits

Managing Documents

Connectors

Connector Setup Workflow

Configuration Options

Sync Operations

Managing Connectors

Custom Connector

JSON Connector

Supported Connectors

Role-based Access Control (RACL)

How RACL Works

Configuring RACL

Permission Entities

Verifying Permissions

Permission Sync Scheduling

RACL Limitations

Quick Reference