#usda#postgres#bulk#fooddata-central#self-hosting

USDA FDC bulk CSV → Postgres for offline lookups

When the API isn't enough: load the entire FDC into Postgres in 20 minutes.

Why bulk

The USDA FDC API gives you 1,000 requests per hour per IP, free. Generous for one user. Not generous for:

  • A self-hosted tracker that does search-as-you-type.
  • A small homelab serving multiple users.
  • Anything offline.

The FDC bulk download gives you the entire database, refreshed monthly, free, no rate limit. About 6 GiB compressed, ~14 GiB uncompressed. Loads into Postgres in about 20 minutes on modest hardware.

What you download

From fdc.nal.usda.gov/download-datasets.html:

  • FoodData_Central_csv_2025-10-31.zip (or whatever the current dump is). Contains:
    • food.csv — base food records with FDC IDs, descriptions, data types
    • nutrient.csv — the master list of nutrients
    • food_nutrient.csv — the (food, nutrient, amount) tuples (the big one)
    • branded_food.csv — extra fields for branded products
    • Several smaller files for Foundation, Survey, etc.

The bulk dump is updated roughly monthly. You don’t need to refresh it daily — for our use case quarterly is fine.

Schema

A practical schema that maps the CSVs into Postgres:

CREATE TABLE food (
    fdc_id INTEGER PRIMARY KEY,
    data_type TEXT NOT NULL,
    description TEXT NOT NULL,
    food_category_id INTEGER,
    publication_date DATE
);

CREATE TABLE nutrient (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    unit_name TEXT NOT NULL,
    rank INTEGER
);

CREATE TABLE food_nutrient (
    id INTEGER PRIMARY KEY,
    fdc_id INTEGER NOT NULL,
    nutrient_id INTEGER NOT NULL,
    amount NUMERIC,
    FOREIGN KEY (fdc_id) REFERENCES food(fdc_id),
    FOREIGN KEY (nutrient_id) REFERENCES nutrient(id)
);

CREATE TABLE branded_food (
    fdc_id INTEGER PRIMARY KEY REFERENCES food(fdc_id),
    brand_owner TEXT,
    gtin_upc TEXT,
    serving_size NUMERIC,
    serving_size_unit TEXT,
    branded_food_category TEXT
);

That’s the minimum useful surface. The full schema has 40+ tables; we don’t need most of them.

Loading

Unzip the dump:

unzip FoodData_Central_csv_2025-10-31.zip -d /tmp/fdc/

COPY from CSV (run from psql):

\copy food                FROM '/tmp/fdc/food.csv'           WITH CSV HEADER
\copy nutrient            FROM '/tmp/fdc/nutrient.csv'       WITH CSV HEADER
\copy food_nutrient       FROM '/tmp/fdc/food_nutrient.csv'  WITH CSV HEADER
\copy branded_food        FROM '/tmp/fdc/branded_food.csv'   WITH CSV HEADER

Sizes: food ~1.5M rows, nutrient ~700, food_nutrient ~28M, branded_food ~1.3M. The last two are the heavy ones.

Time: about 18 minutes total on a Pi 5 with NVMe storage. Closer to 8 minutes on a modest VPS.

Indexes

The defaults aren’t enough. Add at minimum:

CREATE INDEX idx_food_description_trgm ON food USING gin (description gin_trgm_ops);
CREATE INDEX idx_food_data_type ON food (data_type);
CREATE INDEX idx_food_nutrient_fdc ON food_nutrient (fdc_id);
CREATE INDEX idx_food_nutrient_nut ON food_nutrient (nutrient_id);
CREATE INDEX idx_branded_gtin ON branded_food (gtin_upc);

You’ll need pg_trgm for the trigram index (CREATE EXTENSION IF NOT EXISTS pg_trgm;). It makes search-as-you-type viable on the description column.

Index build time: about 4 minutes.

Sample queries

Barcode lookup

SELECT
    f.fdc_id,
    f.description,
    bf.brand_owner,
    bf.serving_size,
    bf.serving_size_unit,
    (SELECT amount FROM food_nutrient fn 
     JOIN nutrient n ON fn.nutrient_id = n.id 
     WHERE fn.fdc_id = f.fdc_id 
       AND n.name = 'Energy' 
       AND n.unit_name = 'kcal') AS kcal_100g
FROM food f
JOIN branded_food bf ON f.fdc_id = bf.fdc_id
WHERE bf.gtin_upc = '038000138416';

p99 latency on the Pi 5: about 4 ms.

SELECT fdc_id, description, data_type
FROM food
WHERE description % 'rolled oats'
  AND data_type IN ('Foundation', 'SR Legacy')
ORDER BY similarity(description, 'rolled oats') DESC
LIMIT 10;

The % operator uses the trigram index. p99 around 12 ms on our setup.

Macro panel for a food

SELECT n.name, n.unit_name, fn.amount
FROM food_nutrient fn
JOIN nutrient n ON fn.nutrient_id = n.id
WHERE fn.fdc_id = 173944
  AND n.name IN ('Energy', 'Protein', 'Carbohydrate, by difference', 'Total lipid (fat)')
ORDER BY n.rank;

p99 around 1 ms.

Wrapping it as a service

A tiny Flask service in front of this gives you a USDA-equivalent API on your LAN:

from flask import Flask, jsonify, request
import psycopg
app = Flask(__name__)
DSN = "postgresql://usda:secret@localhost/usda"

@app.route("/v1/barcode/<gtin>")
def barcode(gtin):
    with psycopg.connect(DSN) as c:
        with c.cursor() as cur:
            cur.execute("""
                SELECT f.fdc_id, f.description, bf.serving_size,
                       (SELECT amount FROM food_nutrient fn 
                        JOIN nutrient n ON fn.nutrient_id = n.id 
                        WHERE fn.fdc_id = f.fdc_id AND n.name = 'Energy' 
                          AND n.unit_name = 'kcal')
                FROM food f JOIN branded_food bf ON f.fdc_id = bf.fdc_id
                WHERE bf.gtin_upc = %s
            """, (gtin,))
            row = cur.fetchone()
    if not row: return ("", 404)
    return jsonify({"fdc_id": row[0], "description": row[1],
                    "serving_size": row[2], "kcal_100g": row[3]})

Behind a Caddy reverse proxy, this gives the rest of your homelab a stable internal endpoint. Sub-10ms responses. No rate limits.

Refresh strategy

Quarterly is enough for most uses. The Branded sub-database changes the most — manufacturers add SKUs continuously — but for a personal tracker the rate of change is smaller than the rate of human consumption changes.

We script refresh with:

  1. Download the new dump.
  2. Load into a _staging schema.
  3. Atomic schema swap (BEGIN; ALTER ...; COMMIT;).
  4. Drop the old.

Total downtime: under 30 seconds. Most of the work happens in the staging schema.

When this isn’t worth it

If you’re a single user with API-key-per-IP, the API is fine for a personal tracker that does maybe 200 lookups a day. Bulk loading is worth it when:

  • Multiple devices share one server.
  • You want offline-first.
  • You’re building a tool that does heavier search.
  • You want to do bulk analysis (notebook-style queries against the whole DB).

References