Content Extraction
Extract specific content from web pages using natural language instructions. The Extract API uses AI to understand what you want to extract and returns structured data in your preferred format.
Overview
The extract endpoint enables you to:
- Extract specific information from web pages using natural language
- Process multiple pages in a single batch request
- Get structured output in markdown, HTML, or plain text
- Control extraction quality and depth
Endpoint
POST /search/extractAuthentication
Headers:
X-Client-ID: your_client_id
X-Client-Secret: your_client_secret
Content-Type: application/jsonRequest Body
| Field | Type | Required | Description |
|---|---|---|---|
pages | array | Yes | Array of pages to extract from |
options | object | No | Additional extraction options |
pages Array
Each page object must include:
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Page URL to extract from |
details | string[] | Yes | Array of extraction instructions |
summaryLevel | string | No | Quality: 'low', 'medium', 'intelligent' |
options Object
| Option | Type | Description |
|---|---|---|
includeMetadata | boolean | Include extraction metadata |
timeout | number | Request timeout in milliseconds (default: 30000) |
maxContentLength | number | Maximum content length to extract |
format | string | Output format: 'markdown', 'html', 'text' (default: 'markdown') |
extractDepth | string | Extraction depth: 'basic', 'advanced' |
includeImages | boolean | Include images in extracted content |
includeFavicon | boolean | Include page favicon |
maxLength | number | Maximum extraction length |
Response
{
success: boolean,
data: Array<{
result: object, // Extracted content
page: {
url: string,
details: string[],
summaryLevel: string
}
}>,
errors: string[],
time_took: number
}Examples
Basic Extraction
import { SearchClient } from 'search-agent';
const client = new SearchClient({
clientId: process.env.OBLIEN_CLIENT_ID,
clientSecret: process.env.OBLIEN_CLIENT_SECRET
});
const extracted = await client.extract([
{
url: 'https://example.com/article',
details: [
'Extract the article title',
'Extract the main content',
'Extract the author name'
]
}
]);
console.log(extracted.data[0].result);const response = await fetch('https://api.oblien.com/search/extract', {
method: 'POST',
headers: {
'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
'Content-Type': 'application/json'
},
body: JSON.stringify({
pages: [
{
url: 'https://example.com/article',
details: [
'Extract the article title',
'Extract the main content',
'Extract the author name'
]
}
]
})
});
const extracted = await response.json();
console.log(extracted.data[0].result);curl -X POST https://api.oblien.com/search/extract \
-H "X-Client-ID: $OBLIEN_CLIENT_ID" \
-H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{
"pages": [
{
"url": "https://example.com/article",
"details": [
"Extract the article title",
"Extract the main content",
"Extract the author name"
]
}
]
}'Response:
{
"success": true,
"data": [
{
"result": {
"data": "**Extract the article title:**\nGetting Started with TypeScript\n\n**Extract the main content:**\nTypeScript is a typed superset of JavaScript that compiles to plain JavaScript...\n\n**Extract the author name:**\nJohn Doe"
},
"page": {
"url": "https://example.com/article",
"details": [
"Extract the article title",
"Extract the main content",
"Extract the author name"
],
"summaryLevel": "medium"
}
}
],
"errors": [],
"time_took": 3247
}Product Information
const extracted = await client.extract([
{
url: 'https://example.com/product/smartphone',
details: [
'Extract product name',
'Extract price',
'Extract all features',
'Extract customer rating',
'Extract availability status'
],
summaryLevel: 'intelligent'
}
], {
format: 'markdown'
});
console.log(extracted.data[0].result);const response = await fetch('https://api.oblien.com/search/extract', {
method: 'POST',
headers: {
'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
'Content-Type': 'application/json'
},
body: JSON.stringify({
pages: [
{
url: 'https://example.com/product/smartphone',
details: [
'Extract product name',
'Extract price',
'Extract all features',
'Extract customer rating',
'Extract availability status'
],
summaryLevel: 'intelligent'
}
],
options: {
format: 'markdown'
}
})
});curl -X POST https://api.oblien.com/search/extract \
-H "X-Client-ID: $OBLIEN_CLIENT_ID" \
-H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{
"pages": [
{
"url": "https://example.com/product/smartphone",
"details": [
"Extract product name",
"Extract price",
"Extract all features",
"Extract customer rating",
"Extract availability status"
],
"summaryLevel": "intelligent"
}
],
"options": {
"format": "markdown"
}
}'Batch Extraction
const extracted = await client.extract([
{
url: 'https://example.com/blog/post-1',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
},
{
url: 'https://example.com/blog/post-2',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
},
{
url: 'https://example.com/blog/post-3',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
}
], {
format: 'markdown',
timeout: 60000
});
extracted.data.forEach((item, index) => {
console.log(`\nPost ${index + 1}:`);
console.log(`URL: ${item.page.url}`);
console.log(`Content:`, item.result);
});const response = await fetch('https://api.oblien.com/search/extract', {
method: 'POST',
headers: {
'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
'Content-Type': 'application/json'
},
body: JSON.stringify({
pages: [
{
url: 'https://example.com/blog/post-1',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
},
{
url: 'https://example.com/blog/post-2',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
},
{
url: 'https://example.com/blog/post-3',
details: [
'Extract blog title',
'Extract publication date',
'Extract main content',
'Extract tags'
],
summaryLevel: 'medium'
}
],
options: {
format: 'markdown',
timeout: 60000
}
})
});curl -X POST https://api.oblien.com/search/extract \
-H "X-Client-ID: $OBLIEN_CLIENT_ID" \
-H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{
"pages": [
{
"url": "https://example.com/blog/post-1",
"details": [
"Extract blog title",
"Extract publication date",
"Extract main content",
"Extract tags"
],
"summaryLevel": "medium"
},
{
"url": "https://example.com/blog/post-2",
"details": [
"Extract blog title",
"Extract publication date",
"Extract main content",
"Extract tags"
],
"summaryLevel": "medium"
},
{
"url": "https://example.com/blog/post-3",
"details": [
"Extract blog title",
"Extract publication date",
"Extract main content",
"Extract tags"
],
"summaryLevel": "medium"
}
],
"options": {
"format": "markdown",
"timeout": 60000
}
}'Research Paper
const extracted = await client.extract([
{
url: 'https://arxiv.org/abs/example',
details: [
'Extract paper title',
'Extract authors',
'Extract abstract',
'Extract key findings',
'Extract methodology summary',
'Extract conclusions',
'Extract references'
],
summaryLevel: 'intelligent'
}
], {
format: 'markdown',
extractDepth: 'advanced',
maxContentLength: 100000
});const response = await fetch('https://api.oblien.com/search/extract', {
method: 'POST',
headers: {
'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
'Content-Type': 'application/json'
},
body: JSON.stringify({
pages: [
{
url: 'https://arxiv.org/abs/example',
details: [
'Extract paper title',
'Extract authors',
'Extract abstract',
'Extract key findings',
'Extract methodology summary',
'Extract conclusions',
'Extract references'
],
summaryLevel: 'intelligent'
}
],
options: {
format: 'markdown',
extractDepth: 'advanced',
maxContentLength: 100000
}
})
});curl -X POST https://api.oblien.com/search/extract \
-H "X-Client-ID: $OBLIEN_CLIENT_ID" \
-H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{
"pages": [
{
"url": "https://arxiv.org/abs/example",
"details": [
"Extract paper title",
"Extract authors",
"Extract abstract",
"Extract key findings",
"Extract methodology summary",
"Extract conclusions",
"Extract references"
],
"summaryLevel": "intelligent"
}
],
"options": {
"format": "markdown",
"extractDepth": "advanced",
"maxContentLength": 100000
}
}'Summary Levels
| Level | Speed | Quality | Use Case |
|---|---|---|---|
| low | Fast (2-5s) | Basic | Simple extractions, large batches |
| medium | Medium (3-8s) | Balanced | Most use cases |
| intelligent | Slower (5-15s) | Highest | Complex content, detailed analysis |
Output Formats
| Format | Description | Use Case |
|---|---|---|
| markdown | Formatted markdown text | Documentation, articles |
| html | HTML markup | Web content, styling preserved |
| text | Plain text | Simple data, no formatting needed |
Error Handling
try {
const extracted = await client.extract([
{
url: 'https://example.com',
details: ['Extract content']
}
]);
if (extracted.success) {
console.log(extracted.data);
} else {
console.error('Errors:', extracted.errors);
}
} catch (error) {
console.error('Extraction failed:', error.message);
}const response = await fetch('https://api.oblien.com/search/extract', {
method: 'POST',
headers: {
'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
'Content-Type': 'application/json'
},
body: JSON.stringify({
pages: [
{
url: 'https://example.com',
details: ['Extract content']
}
]
})
});
const extracted = await response.json();
if (!extracted.success) {
console.error('Errors:', extracted.errors);
}# Missing required details field
curl -X POST https://api.oblien.com/search/extract \
-H "X-Client-ID: $OBLIEN_CLIENT_ID" \
-H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{
"pages": [
{"url": "https://example.com"}
]
}'
# Response:
# {"success": false, "error": "Page at index 0 is missing required field: details"}Best Practices
-
Be specific with details - Clear instructions produce better results
- Good: "Extract product price from the pricing section"
- Bad: "Get price"
-
Batch similar extractions - Process related pages together for better performance
-
Set appropriate timeouts - Increase timeout for large pages or slow sites
-
Use intelligent summary for complex content - Research papers, technical documents
-
Choose the right format - Markdown for readability, text for data processing
-
Handle errors gracefully - Check both success status and errors array
-
Include context in instructions - "Extract author name from the byline section"
Common Use Cases
E-Commerce Data
const products = await client.extract([
{
url: 'https://store.com/product/123',
details: [
'Extract product name from title',
'Extract price from pricing section',
'Extract all specifications from specs table',
'Extract stock status',
'Extract shipping information'
]
}
]);News Articles
const articles = await client.extract([
{
url: 'https://news.com/article',
details: [
'Extract headline',
'Extract publication date',
'Extract author',
'Extract article body',
'Extract related articles'
]
}
]);Company Information
const company = await client.extract([
{
url: 'https://company.com/about',
details: [
'Extract company mission',
'Extract founding year',
'Extract team size',
'Extract office locations',
'Extract contact information'
]
}
]);Rate Limits
- 50 requests per minute per client
- Maximum 20 pages per request
- Maximum 20 details per page
Next Steps
- Learn about Web Search for finding information
- Explore Deep Research for comprehensive website analysis