Introducing Schema Source: Programmatic Blueprinting for Web Data Extraction

Building schemas for AI web scraping is tedious. We built Schema Source, an independent API that instantly reverse-engineers any webpage to generate a production-ready JSON schema. Learn how to combine it with Tabstack to automate programmatic blueprinting, extraction, and data transformations with zero manual config.

·4 min read
Introducing Schema Source: Programmatic Blueprinting for Web Data Extraction

We built Tabstack to replace traditional parsing configurations with a declarative model. Instead of telling a scraper how to navigate a page, you define what data you want using a JSON schema, and our /extract/json and /generate/json endpoints handle the execution.

Schema-first extraction bypasses structural web shifts, but introduces a predictable bottleneck: manual schema design.

Manually building a valid, production-grade JSON schema for a complex web page is tedious. It requires inspecting the DOM, identifying data hierarchies, mapping out types, and writing detailed string descriptions so the underlying LLMs understand the context of each field.

To automate this setup phase, we built Schema Source.

What is Schema Source?

Schema Source is an independent API that reverse-engineers a webpage's visual and semantic layout to instantly generate a structured data schema. Pass a URL via a raw GET request to instantly retrieve a structured JSON blueprint.

For example, if you point it at a content-heavy community page like https://reddit.com/r/nba, Schema Source analyzes the page components and automatically outputs an optimized JSON schema mapping the posts, metrics, and metadata.

You can then pass this schema directly into Tabstack's extraction pipelines to begin pulling live data immediately.

How Schema Source Fits into the Tabstack Workflow

In a schema-first architecture, object descriptions act as system prompts. If your schema has a field called score, an LLM needs to know if that means a sporting event's box score, a user upvote count, or a sentiment rating.

Schema Source handles this contextual work by generating explicit metadata alongside the data types.

Step 1: Bootstrapping the Schema via Schema Source

To generate a structural baseline for a target page, append your target URL directly to the Schema Source endpoint:

async function generateSchema(url) {
	const res = await fetch(
	  `https://schema.tabstack.ai/get/${encodeURIComponent(url)}?format=json`,
	);
 
  const { schema } = await res.json();
 
  return schema;
}

The API analyzes the target page and returns a clean, production-ready JSON schema tailored to its core data elements:

{
   "type":"object",
   "properties":{
      "posts":{
         "type":"array",
         "description":"List of popular posts on Reddit's front page",
         "items":{
            "type":"object",
            "properties":{
               "title":{
                  "type":"string",
                  "description":"The title of the Reddit post"
               },
               "author":{
                  "type":"string",
                  "description":"The username of the post's author"
               },
               "score":{
                  "type":"number",
                  "description":"The score (upvotes) of the post"
               },
               "comment_count":{
                  "type":"number",
                  "description":"The number of comments on the post"
               },
               "permalink":{
                  "type":"string",
                  "description":"The direct link to the Reddit post discussion page"
               },
               "published_at":{
                  "type":"string",
                  "description":"The timestamp when the post was published"
               }
            },
            "required":[
               "title",
               "author",
               "score",
               "comment_count",
               "permalink",
               "published_at"
            ]
         },
         "maxItems":10
      }
   },
   "required":[
      "posts"
   ]
}

Step 2: Running a Direct Extraction with Tabstack

Deploying the schema into a production pipeline requires zero manual layout configuration. Pass the generated schema directly into the official @tabstack/sdk to execute a literal data extraction.

import Tabstack from '@tabstack/sdk';
 
const client = new Tabstack({ apiKey: process.env.TABSTACK_API_KEY });
 
const url = 'https://reddit.com/r/nba';
const json_schema = await generateSchema(url);
 
async function extractLiteralData() {
  const response = await client.extract.json({
    url,
    json_schema,
  });
 
  console.log('Extracted Data:', JSON.stringify(response.data, null, 2));
}

Step 3: Layering AI Transformation with generate.json

If your pipeline requires transforming, filtering, or analyzing data rather than just mirroring what is literally on the page, you can use the exact same schema with Tabstack's /generate/json endpoint.

While /extract/json returns literal data from the DOM, generate allows you to pass custom prompt instructions (like summarization or synthesis) while guaranteeing the engine wraps those modified results perfectly inside your schema structure:

async function extractAndTransformData() {
  const response = await client.generate.json({
    url,
    json_schema,
    instructions: "Filter out any threads that aren't about live game updates. For the remaining posts, rewrite the title to be a strictly objective 5-word summary.",
  });
 
  console.log('Transformed Data:', JSON.stringify(response.data, null, 2));
}

Decoupling Configuration from Execution

We decoupled Schema Source to isolate structural data discovery from production data ingestion pipelines.

This architecture eliminates the need to manually draft schemas or guess layout hierarchies during the initial development cycle. Teams can use Schema Source programmatically to seed baseline data definitions, then rely entirely on Tabstack’s infrastructure to fetch, parse, and validate those payloads at scale.

Try generating your first schema at schema.tabstack.ai.

Start automating the web in minutes.

The model, the browser, and the orchestration all run on Tabstack. You just make the call.