The Enrichment Pipeline
What happens after upload beyond raw indexing, and how enrichment improves the user-facing source surfaces.
Ingestion gets a document into the system. Enrichment makes that document easier to understand, rank, and browse.
This distinction matters. A source can be technically present in MARCUS before it is richly interpretable to a human user. Enrichment is what helps bridge that gap.
Ingestion Versus Enrichment
It helps to separate two ideas:
- Ingestion makes the source searchable.
- Enrichment makes the source understandable and easier to manage.
Without ingestion, the document cannot participate in retrieval. Without enrichment, the document may still retrieve, but users have less help interpreting what it is and how much weight it should carry.
Upload To Ready-State Flow
At a high level, the source pipeline looks like this:
- Register the uploaded asset as a source.
- Extract text and metadata from the file.
- Chunk the content.
- Generate embeddings and persist chunk rows.
- Verify indexed chunk integrity.
- Run enrichment passes such as summary, authority scoring, and concept extraction when configured.
The exact timing can vary depending on environment and queue configuration, but this is the practical flow users experience.
Stage 1: Source Registration
As soon as a file is accepted, MARCUS creates a source record and stores basic information about the upload.
At this stage, the document may show up in the project list even though it is not yet retrievable. This is normal and often confuses first-time users.
Stage 2: Text And Metadata Extraction
MARCUS then tries to pull usable text and metadata from the file.
This is where document quality becomes very important:
- a clean text PDF often works well
- a scanned image or malformed export may extract poorly
- missing metadata does not always break retrieval, but it does reduce interpretability
If later briefing content looks strange, this extraction stage is often where the problem began.
Stage 3: Chunking
The extracted text is broken into smaller pieces called chunks.
This is necessary because most questions are answered from part of a document, not from the entire file at once. Chunking lets retrieval find the relevant section rather than simply pointing to a whole PDF and hoping the user can find the right paragraph.
Good chunking improves:
- retrieval precision
- citation usefulness
- answer specificity
Stage 4: Embedding And Indexing
Each chunk is transformed into a retrieval-friendly representation and stored. Once this stage succeeds, the source can usually participate in search.
This is the practical meaning of a source becoming indexed or ready.
At this point:
- the source may already be retrievable in chat
- some enrichment fields may still be missing
That difference explains why a source can answer questions before every field in its briefing panel is visible.
Stage 5: Enrichment Passes
After indexing, MARCUS can run additional analysis to create source-level support material.
Enrichment can populate:
- summary text
- key points
- tags and extracted concepts
- document-type inference
- authority explanation
These are the fields that make the briefing and Library surfaces usable rather than just searchable.
Why Enrichment Matters So Much For Humans
Two projects can both be searchable, but the one with richer enrichment is easier to audit and maintain because users can inspect source quality faster.
Enrichment helps answer questions like:
- Did this document upload correctly?
- Is this the kind of source I expected?
- Does the summary match the document?
- Does the authority level make sense?
- Is this source likely to help answer the questions I care about?
Without enrichment, users can still search, but they have fewer shortcuts for evaluating corpus quality.
Why Enrichment May Lag Behind Searchability
Indexing and enrichment are not always completed at exactly the same moment. A source may become retrievable before every enrichment field is visible in the briefing.
This is normal because:
- retrieval depends on chunk and embedding availability
- enrichment depends on additional analysis passes
- those later passes may take extra time or run asynchronously
So "I can ask about it" and "I can fully inspect it" may happen in that order rather than simultaneously.
Common User Misunderstandings
| Misunderstanding | Better interpretation |
|---|---|
| "The source is in the list, so it must already be fully usable." | Visible in the list and fully indexed are not always the same thing. |
| "The briefing is incomplete, so the source is broken." | The source may retrieve normally while some enrichment fields are still pending. |
| "The summary looks odd, so the model is bad." | Odd briefings often begin with poor file quality or extraction problems. |
| "If chat can use the source, I do not need the briefing." | Briefings still help you judge whether the source is trustworthy and well-classified. |
Operational Implications
If you want a healthy corpus, enrichment should be part of your review habit:
- upload the source
- wait for indexing
- open the briefing
- check whether the source looks right
- ask a narrow test question
This is a much more reliable workflow than uploading many files and assuming that if no visible error appears, everything must be fine.
What Enrichment Cannot Do
Enrichment improves interpretability, but it does not automatically fix:
- bad project boundaries
- missing key sources
- contradictory versions
- poor local governance of the corpus
It is a quality amplifier, not a substitute for source curation.
One Useful Mental Model
Think of ingestion as "getting the document into the room" and enrichment as "putting a readable label on it, summarizing it, and telling you how much weight it likely deserves." Both matter, but they solve different problems.