Websites
Crawl and index web pages from any public or authenticated website. Navigate to Content > Websites to manage web sources.Adding a Web Crawler
Click +Web Crawl and configure the following:| Field | Description |
|---|---|
| Source Title | Unique name for the web source |
| Description | Information about the source and its purpose |
| Crawl Source | URL, Upload Sitemap (CSV), or Upload URL (CSV) |
| Crawl Depth | Levels to traverse (1–10, default: 5) |
| Max URL Limit | Maximum URLs to crawl (1–10,000, default: 10) |
Advanced Crawl Options
| Option | Purpose |
|---|---|
| Use Cookies | Crawl pages requiring cookie acceptance |
| JavaScript Rendered | Capture dynamically rendered content (set crawl delay for full rendering) |
| Crawl Beyond Sitemap | Include URLs not listed in the sitemap |
| Respect robots.txt | Honor crawler directives in robots.txt |
| Automatic Cleaning | Remove headers, footers, and head tags before ingestion |
| Retain Original | Ingest complete HTML content for custom transformation via Workbench |
URL Filtering
Control which pages are crawled using rules:- Crawl everything — All discovered URLs
- Crawl everything except — Block URLs matching specified conditions
- Crawl only specific URLs — Allow only URLs matching specified conditions
Authentication
Search AI supports two authentication methods for protected websites: Basic HTTP Authentication- Username/email and password
- Optional authorization fields (headers, payload, query string, path parameters)
- Authentication URL (may differ from source URL)
- Test type: Text Presence, Redirection, or Status Code
- Form fields with key, type, and value
- Session maintained after initial authentication
- Same test options as Basic Auth
Managing Web Sources
| Action | How To |
|---|---|
| View crawled pages | Go to source > Pages tab (shows Successful, Failed, Skipped) |
| View execution logs | Go to source > Executions tab |
| Recrawl entire source | Click Recrawl action or use Re-Crawl button on configuration page |
| Recrawl specific page | Pages tab > Actions > Recrawl |
| Schedule recrawl | Set date, time, and frequency in crawl configuration |
| Delete source | Click Delete action (removes all indexed content) |
Troubleshooting
| Issue | Potential Causes | Resolution |
|---|---|---|
| Crawl failure | Permissions, credentials, invalid URL, domain not whitelisted | Verify access, credentials, and URL format |
| Crawl succeeds but no indexing | JS-rendered pages, incorrect include/exclude rules, undiscoverable pages | Enable JavaScript Rendered, check rules, verify sitemap |
| Crawl takes too long | Large content volume, JS pages returning no content | Adjust max URLs, enable JavaScript Rendered |
Documents
Upload and index files directly to Search AI. Navigate to Content > Documents to manage uploaded files.Supported Formats
PDF, DOCX, PPT, TXT (scanned PDFs and password-protected files not supported)Upload Options
| Method | Description |
|---|---|
| File | Upload one or more files to an existing directory |
| URL | Upload a file from a remote URL with title and description |
| Directory | Upload a complete folder from local device |
Upload Limits
| Limit | Default Value |
|---|---|
| File size | 15 MB maximum per file |
| Batch upload | 40 files at once |
| Directory upload | 20 files maximum |
Managing Documents
- View files: Click any directory to see its files with type and page count
- View file details: Click a file for preview, metadata, and JSON view
- Delete file: Actions > Delete (from directory or file details page)
- Delete directory: Actions > Delete (removes directory and all files)
- Search: Use search bar on Directory page or within directory details
Connectors
Pre-built integrations for 60+ third-party applications. Navigate to Content > Connectors to configure integrations.Connector Setup Workflow
- Authentication — Provide OAuth credentials, API keys, or tokens
- Ingestion — Select content types and apply filters
- Field Mapping — Align source fields with Search AI schema (optional)
- Permissions — Configure RACL settings
Configuration Options
Content Types: Most connectors support multiple content types (pages, articles, tasks, tickets, documents). Select which types to ingest under the Ingestion section. Filters: Some connectors support selective ingestion by timeframe, category, user assignment, or other criteria. Field Mapping: Customize how source fields map to Search AI’s schema using custom scripts for data transformation.Sync Operations
| Type | Description |
|---|---|
| Manual sync | Click Sync Now on Configuration tab |
| Scheduled sync | Set automatic intervals (files >15MB are skipped) |
| Permission sync | Separate scheduler for RACL updates |
Managing Connectors
- Enable/Disable: Toggle connector without deleting data (disables sync temporarily)
- View content: Content tab lists ingested items with URLs and timestamps
- Remove connector: Connection Settings > Remove this content source (deletes all indexed data)
Custom Connector
For applications without a pre-built connector, the Custom Connector uses REST APIs via a middleware service. Setup Steps:- Download the reference implementation from Search AI
- Configure
config.jsonwith application details and content fields - Set authentication in
.envfile - Host the service and configure endpoint in Search AI
- Add headers (Authorization required for default implementation)
sys_racl field in config.
JSON Connector
Upload structured data in JSON format for indexing.- Upload up to 10 files at once
- Maximum 15 MB per file
- Add to existing source or create new
Supported Connectors
Search AI provides out-of-the-box support for ingesting data from a range of third-party repositories. If you want to use a repository not listed below, contact us. RACL: Auto — permissions synced automatically from source; Manual — RACL available, manual entity management required; No — not supported.| Connector | Repository | Supported Content | Filtering | RACL |
|---|---|---|---|---|
| Aha! | Cloud | Ideas, Features | No | Auto |
| Airtable | Cloud | Spreadsheets, Databases | No | Manual |
| Amazon S3 | Cloud | .pdf, .ppt, .txt, .docx | No | No |
| Asana | Cloud | Projects, Tasks | No | Auto |
| Axero | Cloud | Pages, Wiki, Discussions, Documents, Articles, Announcements, Blogs | Yes | Auto |
| Azure Storage | Cloud | .txt, .pdf, .rtf, .doc, .docx, .ppt, .pptx | No | Manual |
| BitBucket | Cloud | Pull Requests | No | Auto |
| Box | Cloud | .pdf, .docx, .txt | No | Auto |
| Clickup | Cloud | Tasks, Subtasks, Checklists | No | Manual |
| Coda Docs | Cloud | Pages | No | Manual |
| Confluence Cloud | Cloud | Knowledge Articles | Yes | Auto |
| Confluence Server | On-prem | Knowledge Articles | Yes | Auto |
| Custom Connector | Cloud | — | No | Manual |
| Datadog | Cloud | Metrics, Dashboards, Monitors | Yes | Manual |
| DotCMS | Cloud | — | Yes | Manual |
| Dropbox | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt, .html | No | Manual |
| Egnyte | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt | No | Manual |
| Figma | Cloud | Figma Files | No | No |
| FreshDesk | Cloud | Tickets | No | Auto |
| Freshservice | Cloud | Solution Articles | No | Manual |
| Front | Cloud | Knowledge Base Articles | No | Manual |
| GitHub | Cloud | Issues, README, Pull Requests | Yes | Auto |
| GitHub On-Prem | On-prem | Issues, Pull Requests, Pages, Files, Commits | Yes | Manual |
| GitLab | Cloud | Issues | No | Manual |
| Google Drive | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt, .html | Yes | Auto |
| Guru | Cloud | Cards | No | Auto |
| HelpScout | Cloud | Articles | No | Auto |
| Hive | Cloud | Actions, Sub-actions | No | Auto |
| HubSpot | Cloud | Tickets, Deals, Companies, Contacts | No | Auto |
| Jenkins | Cloud | Dashboard, Jobs, Builds, Plugins | No | No |
| JFrog Artifactory | Cloud | Artifacts | No | Auto |
| Jira | Cloud | Issues, Filters, Dashboards | No | Auto |
| Jira On-Prem | On-prem | Work Items | Yes | Auto |
| JSON Connector | — | Structured JSON Data | No | No |
| LumApps ¹ | Cloud | Pages, News, Custom Objects, Community Posts | Yes | Auto |
| Miro | Cloud | Boards | No | Auto |
| Monday | Cloud | Board Items | No | Manual |
| MS Teams | Cloud | Channel Conversations | No | Auto |
| Notion | Cloud | Pages | No | Manual |
| OneDrive | Cloud | .aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf | No | Auto |
| Opsgenie | Cloud | Alerts, Incidents | No | Manual |
| Oracle Knowledge | Cloud | Knowledge Articles | No | No |
| PagerDuty | Cloud | Schedules, Escalation Policies | No | Auto |
| Re:amaze | Cloud | Articles | No | Auto |
| Salesforce | Cloud | Knowledge Articles | Yes | No |
| ServiceNow | Cloud | Incidents, Service Catalog, Knowledge Articles | Yes | Auto |
| SharePoint | Cloud | .aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf | Yes | Auto |
| Shopify | Cloud | Articles, Product Catalog | No | Manual |
| Shortcut | Cloud | Stories | No | Auto |
| Slab | Cloud | Posts | No | Manual |
| Slack | Cloud | Messages | No | Auto |
| Teamwork | Cloud | Tasks | No | Manual |
| TestRail | Cloud | Test Cases | No | Manual |
| Trello | Cloud | Boards, Cards | No | Auto |
| WordPress | Cloud | Pages, Posts | No | Manual |
| Workday | Cloud | HR Org Charts, Employee Details | No | Auto |
| Wrike | Cloud | Tasks | No | Manual |
| xMatters | Cloud | Incidents, Events, On-Calls | No | Auto |
| YouTrack | Cloud | Projects, Issues, Knowledge Articles | No | Auto |
| Zendesk | Cloud | Knowledge Articles, Tickets | No | Auto |
| Zeplin | Cloud | Screens | No | Auto |
| Zoho | Cloud | Leads, Accounts, Contacts, Deals | No | Manual |
| Zoom | Cloud | Meeting Summaries | Yes | Auto |
| Zulip | Cloud | Messages | No | Auto |
Role-based Access Control (RACL)
RACL ensures users only see content they’re authorized to access by synchronizing permissions from source applications.How RACL Works
- Ingestion: Search AI imports permissions (users, groups, criteria) along with content
- Storage: Access information stored in the
sys_raclfield of each chunk - Access Control: Only users whose identities appear in
sys_raclcan view answers from that content
Configuring RACL
Go to Permissions page of the connector and select:| Option | Behavior |
|---|---|
| Same users as in source | Syncs permissions from source application (Restricted Access) |
| Everyone (Public Access) | Content accessible to all Search AI users (sets sys_racl to *) |
Permission Entities
Permission entities represent groups or user criteria from source applications. Search AI handles these in two ways: Automatic Resolution (supported by some connectors):- Fetches group membership data from source
- Maintains up-to-date user-group mappings
- Associates users with permission entities automatically
- Use Permission Entity APIs to manage user associations
- Required for connectors without automatic resolution support
Verifying Permissions
- Go to Content page
- Open JSON view for any file
- Check the
sys_raclfield contents
Permission Sync Scheduling
Access permissions often change more frequently than content. Configure separate sync schedules:- Open Permissions section of connector
- Enable Permissions Sync Scheduled
- Set time and frequency
RACL Limitations
- Supported for connector content only (not websites or uploaded documents)
- Switching from Restricted to Public Access automatically disables RACL
Quick Reference
| Task | Location |
|---|---|
| Add website | Content > Websites > +Web Crawl |
| Upload files | Content > Documents > Upload |
| Add connector | Content > Connectors > Select connector |
| View indexed content | Index > Content Browser |
| Configure extraction | Index > Extraction |
| Test answers | Configuration > Testing |