Implementing Scraping Results Storage in Database
Scraping results must not just be saved — they must be saved so you can track change history, make quick queries, and not lose data on re-crawl.
Table Schema
Two levels: raw data and normalized.
-- Raw HTML or JSON response (for debugging and re-parsing)
CREATE TABLE scrape_raw (
id BIGSERIAL PRIMARY KEY,
site_id INTEGER NOT NULL,
url TEXT NOT NULL,
body TEXT,
status_code SMALLINT,
scraped_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT uq_scrape_raw UNIQUE (site_id, url, DATE(scraped_at))
);
-- Normalized products
CREATE TABLE scraped_products (
id BIGSERIAL PRIMARY KEY,
site_id INTEGER NOT NULL,
external_id VARCHAR(255),
url TEXT NOT NULL,
name TEXT,
price NUMERIC(12,2),
currency CHAR(3),
in_stock BOOLEAN,
data JSONB, -- everything else not in columns
scraped_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT uq_scraped_product UNIQUE (site_id, external_id)
);
CREATE INDEX idx_scraped_products_site ON scraped_products (site_id);
CREATE INDEX idx_scraped_products_data ON scraped_products USING gin(data);
JSONB column data stores everything not standardized: characteristics, images, attributes. GIN index enables fast content queries.
Upsert on Re-Scrape
def save_product(conn, site_id: int, product: dict):
conn.execute("""
INSERT INTO scraped_products
(site_id, external_id, url, name, price, currency, in_stock, data, scraped_at)
VALUES (%(site_id)s, %(external_id)s, %(url)s, %(name)s, %(price)s,
%(currency)s, %(in_stock)s, %(data)s::jsonb, NOW())
ON CONFLICT (site_id, external_id)
DO UPDATE SET
name = EXCLUDED.name,
price = EXCLUDED.price,
in_stock = EXCLUDED.in_stock,
data = EXCLUDED.data,
updated_at = NOW(),
scraped_at = NOW()
""", {**product, 'site_id': site_id, 'data': json.dumps(product.get('extra', {}))})
Implementation Timeline
Basic schema with upsert logic and indexes — 1–2 business days.







