Ctrl + K

Content Extraction

Extract specific content from web pages using natural language instructions. The Extract API uses AI to understand what you want to extract and returns structured data in your preferred format.

Overview

The extract endpoint enables you to:

  • Extract specific information from web pages using natural language
  • Process multiple pages in a single batch request
  • Get structured output in markdown, HTML, or plain text
  • Control extraction quality and depth

Endpoint

POST /search/extract

Authentication

Headers:
  X-Client-ID: your_client_id
  X-Client-Secret: your_client_secret
  Content-Type: application/json

Request Body

FieldTypeRequiredDescription
pagesarrayYesArray of pages to extract from
optionsobjectNoAdditional extraction options

pages Array

Each page object must include:

FieldTypeRequiredDescription
urlstringYesPage URL to extract from
detailsstring[]YesArray of extraction instructions
summaryLevelstringNoQuality: 'low', 'medium', 'intelligent'

options Object

OptionTypeDescription
includeMetadatabooleanInclude extraction metadata
timeoutnumberRequest timeout in milliseconds (default: 30000)
maxContentLengthnumberMaximum content length to extract
formatstringOutput format: 'markdown', 'html', 'text' (default: 'markdown')
extractDepthstringExtraction depth: 'basic', 'advanced'
includeImagesbooleanInclude images in extracted content
includeFaviconbooleanInclude page favicon
maxLengthnumberMaximum extraction length

Response

{
  success: boolean,
  data: Array<{
    result: object,  // Extracted content
    page: {
      url: string,
      details: string[],
      summaryLevel: string
    }
  }>,
  errors: string[],
  time_took: number
}

Examples

Basic Extraction

import { SearchClient } from 'search-agent';

const client = new SearchClient({
  clientId: process.env.OBLIEN_CLIENT_ID,
  clientSecret: process.env.OBLIEN_CLIENT_SECRET
});

const extracted = await client.extract([
  {
    url: 'https://example.com/article',
    details: [
      'Extract the article title',
      'Extract the main content',
      'Extract the author name'
    ]
  }
]);

console.log(extracted.data[0].result);
const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/article',
        details: [
          'Extract the article title',
          'Extract the main content',
          'Extract the author name'
        ]
      }
    ]
  })
});

const extracted = await response.json();
console.log(extracted.data[0].result);
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/article",
        "details": [
          "Extract the article title",
          "Extract the main content",
          "Extract the author name"
        ]
      }
    ]
  }'

Response:

{
  "success": true,
  "data": [
    {
      "result": {
        "data": "**Extract the article title:**\nGetting Started with TypeScript\n\n**Extract the main content:**\nTypeScript is a typed superset of JavaScript that compiles to plain JavaScript...\n\n**Extract the author name:**\nJohn Doe"
      },
      "page": {
        "url": "https://example.com/article",
        "details": [
          "Extract the article title",
          "Extract the main content",
          "Extract the author name"
        ],
        "summaryLevel": "medium"
      }
    }
  ],
  "errors": [],
  "time_took": 3247
}

Product Information

const extracted = await client.extract([
  {
    url: 'https://example.com/product/smartphone',
    details: [
      'Extract product name',
      'Extract price',
      'Extract all features',
      'Extract customer rating',
      'Extract availability status'
    ],
    summaryLevel: 'intelligent'
  }
], {
  format: 'markdown'
});

console.log(extracted.data[0].result);
const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/product/smartphone',
        details: [
          'Extract product name',
          'Extract price',
          'Extract all features',
          'Extract customer rating',
          'Extract availability status'
        ],
        summaryLevel: 'intelligent'
      }
    ],
    options: {
      format: 'markdown'
    }
  })
});
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/product/smartphone",
        "details": [
          "Extract product name",
          "Extract price",
          "Extract all features",
          "Extract customer rating",
          "Extract availability status"
        ],
        "summaryLevel": "intelligent"
      }
    ],
    "options": {
      "format": "markdown"
    }
  }'

Batch Extraction

const extracted = await client.extract([
  {
    url: 'https://example.com/blog/post-1',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  },
  {
    url: 'https://example.com/blog/post-2',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  },
  {
    url: 'https://example.com/blog/post-3',
    details: [
      'Extract blog title',
      'Extract publication date',
      'Extract main content',
      'Extract tags'
    ],
    summaryLevel: 'medium'
  }
], {
  format: 'markdown',
  timeout: 60000
});

extracted.data.forEach((item, index) => {
  console.log(`\nPost ${index + 1}:`);
  console.log(`URL: ${item.page.url}`);
  console.log(`Content:`, item.result);
});
const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com/blog/post-1',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      },
      {
        url: 'https://example.com/blog/post-2',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      },
      {
        url: 'https://example.com/blog/post-3',
        details: [
          'Extract blog title',
          'Extract publication date',
          'Extract main content',
          'Extract tags'
        ],
        summaryLevel: 'medium'
      }
    ],
    options: {
      format: 'markdown',
      timeout: 60000
    }
  })
});
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://example.com/blog/post-1",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      },
      {
        "url": "https://example.com/blog/post-2",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      },
      {
        "url": "https://example.com/blog/post-3",
        "details": [
          "Extract blog title",
          "Extract publication date",
          "Extract main content",
          "Extract tags"
        ],
        "summaryLevel": "medium"
      }
    ],
    "options": {
      "format": "markdown",
      "timeout": 60000
    }
  }'

Research Paper

const extracted = await client.extract([
  {
    url: 'https://arxiv.org/abs/example',
    details: [
      'Extract paper title',
      'Extract authors',
      'Extract abstract',
      'Extract key findings',
      'Extract methodology summary',
      'Extract conclusions',
      'Extract references'
    ],
    summaryLevel: 'intelligent'
  }
], {
  format: 'markdown',
  extractDepth: 'advanced',
  maxContentLength: 100000
});
const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://arxiv.org/abs/example',
        details: [
          'Extract paper title',
          'Extract authors',
          'Extract abstract',
          'Extract key findings',
          'Extract methodology summary',
          'Extract conclusions',
          'Extract references'
        ],
        summaryLevel: 'intelligent'
      }
    ],
    options: {
      format: 'markdown',
      extractDepth: 'advanced',
      maxContentLength: 100000
    }
  })
});
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {
        "url": "https://arxiv.org/abs/example",
        "details": [
          "Extract paper title",
          "Extract authors",
          "Extract abstract",
          "Extract key findings",
          "Extract methodology summary",
          "Extract conclusions",
          "Extract references"
        ],
        "summaryLevel": "intelligent"
      }
    ],
    "options": {
      "format": "markdown",
      "extractDepth": "advanced",
      "maxContentLength": 100000
    }
  }'

Summary Levels

LevelSpeedQualityUse Case
lowFast (2-5s)BasicSimple extractions, large batches
mediumMedium (3-8s)BalancedMost use cases
intelligentSlower (5-15s)HighestComplex content, detailed analysis

Output Formats

FormatDescriptionUse Case
markdownFormatted markdown textDocumentation, articles
htmlHTML markupWeb content, styling preserved
textPlain textSimple data, no formatting needed

Error Handling

try {
  const extracted = await client.extract([
    {
      url: 'https://example.com',
      details: ['Extract content']
    }
  ]);
  
  if (extracted.success) {
    console.log(extracted.data);
  } else {
    console.error('Errors:', extracted.errors);
  }
} catch (error) {
  console.error('Extraction failed:', error.message);
}
const response = await fetch('https://api.oblien.com/search/extract', {
  method: 'POST',
  headers: {
    'X-Client-ID': process.env.OBLIEN_CLIENT_ID,
    'X-Client-Secret': process.env.OBLIEN_CLIENT_SECRET,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pages: [
      {
        url: 'https://example.com',
        details: ['Extract content']
      }
    ]
  })
});

const extracted = await response.json();
if (!extracted.success) {
  console.error('Errors:', extracted.errors);
}
# Missing required details field
curl -X POST https://api.oblien.com/search/extract \
  -H "X-Client-ID: $OBLIEN_CLIENT_ID" \
  -H "X-Client-Secret: $OBLIEN_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "pages": [
      {"url": "https://example.com"}
    ]
  }'

# Response:
# {"success": false, "error": "Page at index 0 is missing required field: details"}

Best Practices

  1. Be specific with details - Clear instructions produce better results

    • Good: "Extract product price from the pricing section"
    • Bad: "Get price"
  2. Batch similar extractions - Process related pages together for better performance

  3. Set appropriate timeouts - Increase timeout for large pages or slow sites

  4. Use intelligent summary for complex content - Research papers, technical documents

  5. Choose the right format - Markdown for readability, text for data processing

  6. Handle errors gracefully - Check both success status and errors array

  7. Include context in instructions - "Extract author name from the byline section"

Common Use Cases

E-Commerce Data

const products = await client.extract([
  {
    url: 'https://store.com/product/123',
    details: [
      'Extract product name from title',
      'Extract price from pricing section',
      'Extract all specifications from specs table',
      'Extract stock status',
      'Extract shipping information'
    ]
  }
]);

News Articles

const articles = await client.extract([
  {
    url: 'https://news.com/article',
    details: [
      'Extract headline',
      'Extract publication date',
      'Extract author',
      'Extract article body',
      'Extract related articles'
    ]
  }
]);

Company Information

const company = await client.extract([
  {
    url: 'https://company.com/about',
    details: [
      'Extract company mission',
      'Extract founding year',
      'Extract team size',
      'Extract office locations',
      'Extract contact information'
    ]
  }
]);

Rate Limits

  • 50 requests per minute per client
  • Maximum 20 pages per request
  • Maximum 20 details per page

Next Steps