Knowledge Base Schema
H and H Coffee Factory — Knowledge Base Schema
This document overrides all LLM Wiki skill defaults. Read it first.
Architecture — knowledge-base/ is the single source of truth
This repository is a one-person experimental research project. The wiki is the single canonical source of truth for all research-grade data: people, companies, brands, places, events, documents, accession records, artifact catalog items, and gallery definitions. The Jekyll site renders from a projection of the wiki into _data/ (built by scripts/accessions_build_jekyll_data.exs and related generators).
┌────────────────────────────────────┐
│ knowledge-base/ (SOURCE OF TRUTH) │
│ │
│ people/ brands/ companies/ │
│ places/ events/ documents/ │
│ accessions/ artifacts/ │
│ galleries/ │
│ │
│ raw-sources/ raw-archives/ │
└──────────────┬───────────────────────┘
│ (Elixir generators)
▼
┌────────────────────────────────────┐
│ _data/ _brands/ (PROJECTION) │
│ │
│ _data/accessions.yml │
│ _data/galleries/*/items/*.yml │
│ _data/galleries/*/order.yml │
│ _brands/*.md │
└──────────────┬───────────────────────┘
│ (Jekyll build)
▼
_site/ (rendered output)
Implication: do not edit _data/accessions.yml, _data/galleries/*/items/, or _brands/ by hand. Edit the canonical record under knowledge-base/ and re-run the generator.
Directory
knowledge-base/
Topics — narrative research
Compiled pages live in topic directories:
| Directory | Contents |
|---|---|
people/ |
Biographical pages — founders, family members, employees |
brands/ |
Brand histories — product lines, trademarks, packaging. Canonical source for _brands/*.md Jekyll collection |
companies/ |
Organizational pages — coffee companies (Hoffman-Hayman, Western Coffee Co., etc.) |
events/ |
Historical milestones, openings, closings, transitions. Files with timeline: true in frontmatter are projected to _data/events.yml for the Jekyll /history/ page. Rich research synthesis pages (without timeline:) coexist as KB-only. |
documents/ |
Synthesis pages for specific primary documents |
places/ |
Locations — factories, offices, distribution points |
Topics — structured collection data
These directories carry one file per record. Frontmatter is the structured data; body is research notes / interpretation.
| Directory | Contents | Projects to |
|---|---|---|
accessions/ |
One file per accession record (e.g. HH-AD-2014-0001.md). Schema in docs/history/2026-04-30-h-and-h-accession-and-loan-readiness-design.md. Frontmatter mirrors _data/accession_records/ v2 schema (accession_id, object_title, category, acquisition_source, acquired_date, acquisition_reference, possession_status, location, condition, loan_ready, notes). |
_data/accessions.yml (Jekyll consumes) |
artifacts/ |
One file per catalog item, keyed by clip_id (e.g. HH-REF-2023-0001.md, HH-FACT-0000-0001.md). Frontmatter mirrors the catalog item schema (clip_id, title, alt, image_basename, image_path, url, gallery). Body is for research notes specific to the artifact. |
_data/galleries/<gallery>/items/<clip_id>.yml |
galleries/ |
One file per gallery (e.g. reference.yml, factory.yml). Sequence/ordering + gallery-level metadata. |
_data/galleries/<gallery>/order.yml |
Source Buckets
Raw sources live in raw-sources/ organized by type. Note: raw-sources/ is for primary historical sources only (clippings, ads, USPTO filings, etc.). Artifact catalog items and accession records have their own top-level directories (artifacts/, accessions/) — see “Topics — structured collection data” above.
| Bucket | Contents |
|---|---|
newspapers/ |
Clippings, death notices, news articles |
advertisements/ |
Ad copy, promotional records, trade materials (including audio/video transcription discs) |
images/ |
Primary-source scans (USPTO filings, postcards, letterheads, lithographic prints) where the image itself is the document |
research/ |
Secondary sources, research notes, existing docs/ content |
Primary source vs. artifact documentation
The distinction matters: a primary source is the historical document itself (a 1923 newspaper clipping, a 1922 USPTO filing). An artifact is a physical object that survives from the period; an artifact documentation photograph is a 2014–2026 photo of that object. The photograph is contemporary documentation, not a period source. Examples:
- 1923 SA Light clipping (scan) →
newspapers/bucket, primary source - 1922 USPTO trademark Official Gazette clipping (image of the filing) →
images/bucket, primary source - A 1920s H and H Blend tin → an artifact
- A 2014 photo of that 1920s tin →
artifacts/bucket, documents an artifact - A 2026 photo of the 601 Delaware factory exterior →
artifacts/bucket, field documentation
Artifacts — catalog under artifacts/
The artifact catalog lives at knowledge-base/artifacts/<clip_id>.md, one file per item, organized by gallery (gallery is a frontmatter field, not a subdirectory — flat layout for easier cross-cutting search).
Gallery taxonomy (mirrors the legacy _data/galleries/ structure for now):
| Gallery | Contents | Approx. count |
|---|---|---|
branding_newspaper |
Display-ad scans | 51 |
collection |
Items the museum owns | 185 |
factory |
601 Delaware site documentation | 103 |
newspaper |
Newspaper-clip scans | 299 |
not_our_h_and_h |
Look-alikes / non-H&H reference | 17 |
reference |
H&H items documented but not owned | 80 |
wanted |
Items being sought | 9 |
Each artifact file has frontmatter mirroring the catalog item schema, plus optional cross-links to other wiki pages. The body is for research notes specific to that artifact.
---
clip_id: HH-REF-2023-0001
type: artifact
gallery: reference
title: "Large bulk-size H and H Blend Coffee tin (lid missing) — 'H & H' monogram side panel and 'We roast It / others praise It' slogan front-face cartouche, early-1920s Hoffmann-Hayman branding"
alt: "Color photograph documented 2023-05-20 of a heavily worn large square-cross-section bulk-size H and H Blend Coffee tin…"
image_basename: 2023-05-20-h-and-h-blend-large-bulk-tin-monogram-side
image_path: /assets/images/thumbnail/2023-05-20-h-and-h-blend-large-bulk-tin-monogram-side.jpg
url: /assets/images/gallery/2023-05-20-h-and-h-blend-large-bulk-tin-monogram-side.jpg
date_documented: 2023-05-20
brands: [h-and-h-blend]
period_referenced: "early-1920s"
---
# Large bulk-size H and H Blend Coffee tin (early-1920s)
Research notes about variant, attribution, comparable items, etc.
The Elixir generator projects knowledge-base/artifacts/*.md → _data/galleries/<gallery>/items/<clip_id>.yml (Jekyll consumes the projection).
Cross-references from narrative topic pages
When a topic page (people, brands, companies, events, places) cites an artifact, reference the clip_id in frontmatter:
artifacts:
- HH-FACT-0000-0001 # 1932 G.W. Mitchell construction photo of the Hayman factory exterior
- HH-COLL-0000-0042 # 1lb H and H Blend tin (front face)
When to add a raw-sources/ entry for an artifact
Most artifacts do not belong in raw-sources/. That registry is for primary historical sources. Add a raw-sources/ row only when the artifact itself functions as a primary source (e.g., a maker’s mark or embossment that’s the only surviving documentation of a fact). Otherwise: the artifact file under artifacts/<clip_id>.md is its own canonical record.
Accessions
Provenance records — one file per acquisition. Schema documented in docs/history/2026-04-30-h-and-h-accession-and-loan-readiness-design.md.
---
accession_id: "HH-AD-2014-0001"
type: accession
object_title: "1960 vintage print advertisement — The Toy House World, Saint Paul, Minnesota (context ephemera)"
category: "AD"
nomenclature_term: "advertisement"
acquisition_source: "ebay" # ebay | vendor | donation | other
acquired_date: "2014-07-07"
acquisition_reference: "261521640532"
possession_status: "in_collection" # in_collection | on_loan_out | on_loan_in | transferred | deaccessioned | missing | unknown
location: "Private collection, San Antonio, Texas"
condition: "unknown" # excellent | good | fair | poor | unknown
loan_ready: false
artifacts: # optional — clip_ids of the artifact catalog items this accession produced
- HH-REF-2014-0001
notes: >
Reference/context item; draft needs object scan before loan packaging.
---
# 1960 vintage print advertisement — Toy House World
(optional research notes about the acquisition, condition history, etc.)
Raw evidence files (receipt PDFs, eBay screenshots, antique-mall receipts) live at records/ (top-level, unchanged from the existing pattern). Each accession references its evidence via acquisition_reference (eBay transaction_id, antique-mall receipt filename, etc.).
The Elixir generator projects knowledge-base/accessions/*.md → _data/accessions.yml for Jekyll, also writing museum-ready CSV exports.
purchase: frontmatter in _posts/*.md is not removed — posts continue to carry display-time provenance copy. The accession record in knowledge-base/accessions/ is canonical; posts cite by accession_id (frontmatter field) and the generator can validate consistency.
Galleries
Gallery-level configuration (sequence/order, gallery title, description):
---
gallery: reference
title: Reference
description: "Photographs of H and H Coffee items found online — not in our collection"
order_strategy: manual
sequence:
- HH-REF-2023-0001
- HH-REF-2023-0002
- HH-REF-0000-0001
# …
---
The Elixir generator projects knowledge-base/galleries/<gallery>.md → _data/galleries/<gallery>/order.yml.
Reliability
- Primary historical documents:
reliability: high - Secondary sources and research notes:
reliability: mixedunless verified against primary sources
Frontmatter Conventions
Mandatory: title, type, updated, sources
Optional:
tags:— era (1890s, 1900s … 1960s), document type (founder, brand, advertisement), subject (hoffmann, hayman, western-coffee)period:— date range for historical entries (e.g.1899–1920)reliability:—high | mixed | unverified
PDF Handling
Attempt text extraction with pdftotext -layout <file> <output.txt> first. If the PDF is image-based (extraction yields only header metadata), read it visually.
Register the PDF path in raw-sources/index.md.
Inbox Processing
After a source from work/inbox/ is ingested, move it to knowledge-base/raw-archives/<bucket>/ and rename using the item’s publication or creation date:
- Newspapers:
YYYY-MM-DD_slug.pdf(date = publication date) - Advertisements:
YYYY-MM-DD_slug.{pdf,mp3,mp4,m4a,…}(date = publication or session date; audio/video formats allowed) - Images (primary-source scans):
YYYY-MM-DD_slug.<ext>(date = creation/capture date if known, otherwise acquisition date) - Artifacts (object/field photographs): do not move to
raw-archives/. Hand off to_data/galleries/<gallery>/so the Jekyll catalog assigns aclip_id. The binary lives inassets/images/gallery/(canonical) andassets/images/thumbnail/(thumb). The wiki then references theclip_id.
Update the path or clip_id in raw-sources/index.md after moving / cataloging.
Audio / video advertising material
Belongs in advertisements/. The MP4/MP3/M4A binary lives in knowledge-base/raw-archives/advertisements/ alongside a *.transcript.md companion file. Frontmatter slug uses the recording session date (YYYY-MM-DD_…) even if multiple takes share that date. See 1961-08-01_hh-master-chef-radio-broggi-track-1.* for the established pattern.
Generators
All projections from knowledge-base/ to _data/ and _brands/ are written by Elixir scripts under scripts/, each wired into Jekyll’s :after_reset hook via a sibling _plugins/regenerate_*.rb:
| Generator | Reads | Writes | Plugin |
|---|---|---|---|
accessions_validate_and_export.exs |
knowledge-base/accessions/*.md |
_data/accessions.yml, museum CSVs |
_plugins/regenerate_accessions_data.rb |
accessions_build_jekyll_data.exs |
(delegates to above) | _data/accessions.yml |
(same) |
artifacts_build_jekyll_data.exs |
knowledge-base/artifacts/*.md |
_data/galleries/<gallery>/items/<clip_id>.yml |
_plugins/regenerate_artifacts_data.rb |
galleries_build_jekyll_data.exs |
knowledge-base/galleries/*.md |
_data/galleries/<gallery>/order.yml |
_plugins/regenerate_galleries_data.rb |
brands_build_jekyll_collection.exs |
knowledge-base/brands/*.md (only files with jekyll_filename:) |
_brands/*.md (Jekyll collection stubs) |
_plugins/regenerate_brands_collection.rb |
events_build_jekyll_data.exs |
knowledge-base/events/*.md (only files with timeline: true) |
_data/events.yml |
_plugins/regenerate_events_data.rb |
Each plugin honors its <name>_data.regenerate_on_build / .regenerate_only_when_stale config in _config.yml and the matching SKIP_<NAME>_REGEN / FORCE_<NAME>_REGEN env overrides.
Files that stay in _data/ (not migrated)
These remain in _data/ because they are Jekyll-display-only or auto-generated from raw eBay reports:
| File | Why it stays |
|---|---|
navigation.yml |
Site nav config — pure Jekyll display |
ui-text.yml |
UI strings — pure Jekyll display |
story_taxonomy.yml |
Controlled vocab for _pages/artifact-index.md filters — content-coupled to a Jekyll template |
acquisitions.yml, ebay_purchase_history.yml |
Auto-generated by scripts/combine_ebay_purchase_history.exs from _data/2014-2023-ebayReports/ and _data/2023-2026-ebayReports/ raw archives |
_data/galleries/<gallery>/items/, _data/galleries/<gallery>/order.yml |
Projection output from knowledge-base/artifacts/ and knowledge-base/galleries/ (Steps 3–4) |
_data/accessions.yml |
Projection output from knowledge-base/accessions/ (Step 2) |
_data/events.yml |
Projection output from knowledge-base/events/ (Step 6) |
_data/2014-2023-ebayReports/, _data/2023-2026-ebayReports/ |
Raw eBay export archives (evidence) |
Framework Note
knowledge-base/ is the canonical store; Jekyll consumes a projection. The wiki can be migrated to a different static-site framework without losing the underlying research data — only the generators need to change.
Changelog
- v3.1 (2026-05-15) — Step 6 audit complete:
_data/events.ymljoins the projection family (generated fromknowledge-base/events/*.mdwithtimeline: trueopt-in). Vestigial_data/items.yamland_data/crystalvac_jars.yamldeleted (no Jekyll consumers).navigation.yml,ui-text.yml,story_taxonomy.yml,acquisitions.yml, andebay_purchase_history.ymldocumented as stays-in-data. Generator + plugin table moved to single canonical location. - v3 (2026-05-15) —
knowledge-base/becomes the single source of truth for all research-grade data. Addedaccessions/,artifacts/,galleries/as top-level topic directories (canonical records, one file per record)._data/accession_records/,_data/galleries/*/items/,_data/galleries/*/order.yml, and_brands/*.mdbecome Jekyll-side projections generated fromknowledge-base/. Removed the v2raw-sources/artifacts/bucket (artifacts are now top-level, not under raw-sources). Documented the generator chain. - v2 (2026-05-15) — Added
artifacts/bucket underraw-sources/; distinguished primary-source scans from artifact documentation; documented_data/galleries/as authoritative for imaged items. Superseded by v3. - v1 (2026-05-15) — Initial schema with
newspapers/,advertisements/,images/,research/buckets.