Content Extraction

Extract specific content from web pages using natural language instructions. The Extract API uses AI to understand what you want to extract and returns structured data in your preferred format.

Overview

The extract endpoint enables you to:

Extract specific information from web pages using natural language
Process multiple pages in a single batch request
Get structured output in markdown, HTML, or plain text
Control extraction quality and depth

Endpoint

POST /search/extract

Authentication

Headers:
  X-Client-ID: your_client_id
  X-Client-Secret: your_client_secret
  Content-Type: application/json

Request Body

Field	Type	Required	Description
`pages`	array	Yes	Array of pages to extract from
`options`	object	No	Additional extraction options

pages Array

Each page object must include:

Field	Type	Required	Description
`url`	string	Yes	Page URL to extract from
`details`	string[]	Yes	Array of extraction instructions
`summaryLevel`	string	No	Quality: 'low', 'medium', 'intelligent'

options Object

Option	Type	Description
`includeMetadata`	boolean	Include extraction metadata
`timeout`	number	Request timeout in milliseconds (default: 30000)
`maxContentLength`	number	Maximum content length to extract
`format`	string	Output format: 'markdown', 'html', 'text' (default: 'markdown')
`extractDepth`	string	Extraction depth: 'basic', 'advanced'
`includeImages`	boolean	Include images in extracted content
`includeFavicon`	boolean	Include page favicon
`maxLength`	number	Maximum extraction length

Response

{
  success: boolean,
  data: Array<{
    result: object,  // Extracted content
    page: {
      url: string,
      details: string[],
      summaryLevel: string
    }
  }>,
  errors: string[],
  time_took: number
}

Examples

Basic Extraction

import { SearchClient } from 'search-agent';

const client = new SearchClient({
  clientId: process.env.OBLIEN_CLIENT_ID,
  clientSecret: process.env.OBLIEN_CLIENT_SECRET
});

const extracted = await client.extract([
  {
    url: 'https://example.com/article',
    details: [
      'Extract the article title',
      'Extract the main content',
      'Extract the author name'
    ]
  }
]);

console.log(extracted.data[0].result);

const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/article',
        details: [
          'Extract the article title',
          'Extract the main content',
          'Extract the author name'
        ]
      }
    ]
  })
});

const extracted = await response.json();
console.log(extracted.data[0].result);

curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/article",
        "details": [
          "Extract the article title",
          "Extract the main content",
          "Extract the author name"
        ]
      }
    ]
  }'

Response:

{
  "success": true,
  "data": [
    {
      "result": {
        "data": "**Extract the article title:**\nGetting Started with TypeScript\n\n**Extract the main content:**\nTypeScript is a typed superset of JavaScript that compiles to plain JavaScript...\n\n**Extract the author name:**\nJohn Doe"
      },
      "page": {
        "url": "https://example.com/article",
        "details": [
          "Extract the article title",
          "Extract the main content",
          "Extract the author name"
        ],
        "summaryLevel": "medium"
      }
    }
  ],
  "errors": [],
  "time_took": 3247
}

Product Information

const extracted = await client.extract([
  {
    url: 'https://example.com/product/smartphone',
    details: [
      'Extract product name',
      'Extract price',
      'Extract all features',
      'Extract customer rating',
      'Extract availability status'
    ],
    summaryLevel: 'intelligent'
  }
], {
  format: 'markdown'
});

console.log(extracted.data[0].result);

const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/product/smartphone',
        details: [
          'Extract product name',
          'Extract price',
          'Extract all features',
          'Extract customer rating',
          'Extract availability status'
        ],
        summaryLevel: 'intelligent'
      }
    ],
    options: {
      format: 'markdown'
    }
  })
});

curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/product/smartphone",
        "details": [
          "Extract product name",
          "Extract price",
          "Extract all features",
          "Extract customer rating",
          "Extract availability status"
        ],
        "summaryLevel": "intelligent"
      }
    ],
    "options": {
      "format": "markdown"
    }
  }'

Batch Extraction

const extracted = await client.extract([
  {
    url: 'https://example.com/blog/post-1',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  },
  {
    url: 'https://example.com/blog/post-2',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  },
  {
    url: 'https://example.com/blog/post-3',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  }
], {
  format: 'markdown',
  timeout: 60000
});

extracted.data.forEach((item, index) => {
  console.log(`\nPost ${index + 1}:`);
  console.log(`URL: ${item.page.url}`);
  console.log(`Content:`, item.result);
});

const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/blog/post-1',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      },
      {
        url: 'https://example.com/blog/post-2',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      },
      {
        url: 'https://example.com/blog/post-3',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      }
    ],
    options: {
      format: 'markdown',
      timeout: 60000
    }
  })
});

curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/blog/post-1",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      },
      {
        "url": "https://example.com/blog/post-2",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      },
      {
        "url": "https://example.com/blog/post-3",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      }
    ],
    "options": {
      "format": "markdown",
      "timeout": 60000
    }
  }'

Research Paper

const extracted = await client.extract([
  {
    url: 'https://arxiv.org/abs/example',
    details: [
      'Extract paper title',
      'Extract authors',
      'Extract abstract',
      'Extract key findings',
      'Extract methodology summary',
      'Extract conclusions',
      'Extract references'
    ],
    summaryLevel: 'intelligent'
  }
], {
  format: 'markdown',
  extractDepth: 'advanced',
  maxContentLength: 100000
});

const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://arxiv.org/abs/example',
        details: [
          'Extract paper title',
          'Extract authors',
          'Extract abstract',
          'Extract key findings',
          'Extract methodology summary',
          'Extract conclusions',
          'Extract references'
        ],
        summaryLevel: 'intelligent'
      }
    ],
    options: {
      format: 'markdown',
      extractDepth: 'advanced',
      maxContentLength: 100000
    }
  })
});

curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://arxiv.org/abs/example",
        "details": [
          "Extract paper title",
          "Extract authors",
          "Extract abstract",
          "Extract key findings",
          "Extract methodology summary",
          "Extract conclusions",
          "Extract references"
        ],
        "summaryLevel": "intelligent"
      }
    ],
    "options": {
      "format": "markdown",
      "extractDepth": "advanced",
      "maxContentLength": 100000
    }
  }'

Summary Levels

Level	Speed	Quality	Use Case
low	Fast (2-5s)	Basic	Simple extractions, large batches
medium	Medium (3-8s)	Balanced	Most use cases
intelligent	Slower (5-15s)	Highest	Complex content, detailed analysis

Output Formats

Format	Description	Use Case
markdown	Formatted markdown text	Documentation, articles
html	HTML markup	Web content, styling preserved
text	Plain text	Simple data, no formatting needed

Error Handling

try {
  const extracted = await client.extract([
    {
      url: 'https://example.com',
      details: ['Extract content']
    }
  ]);
  
  if (extracted.success) {
    console.log(extracted.data);
  } else {
    console.error('Errors:', extracted.errors);
  }
} catch (error) {
  console.error('Extraction failed:', error.message);
}

const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com',
        details: ['Extract content']
      }
    ]
  })
});

const extracted = await response.json();
if (!extracted.success) {
  console.error('Errors:', extracted.errors);
}

# Missing required details field
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {"url": "https://example.com"}
    ]
  }'

# Response:
# {"success": false, "error": "Page at index 0 is missing required field: details"}

Best Practices

Be specific with details - Clear instructions produce better results
- Good: "Extract product price from the pricing section"
- Bad: "Get price"
Batch similar extractions - Process related pages together for better performance
Set appropriate timeouts - Increase timeout for large pages or slow sites
Use intelligent summary for complex content - Research papers, technical documents
Choose the right format - Markdown for readability, text for data processing
Handle errors gracefully - Check both success status and errors array
Include context in instructions - "Extract author name from the byline section"

Common Use Cases

E-Commerce Data

const products = await client.extract([
  {
    url: 'https://store.com/product/123',
    details: [
      'Extract product name from title',
      'Extract price from pricing section',
      'Extract all specifications from specs table',
      'Extract stock status',
      'Extract shipping information'
    ]
  }
]);

News Articles

const articles = await client.extract([
  {
    url: 'https://news.com/article',
    details: [
      'Extract headline',
      'Extract publication date',
      'Extract author',
      'Extract article body',
      'Extract related articles'
    ]
  }
]);

Rate Limits

50 requests per minute per client
Maximum 20 pages per request
Maximum 20 details per page

Next Steps

Learn about Web Search for finding information
Explore Deep Research for comprehensive website analysis