Morphik lets you filter documents and chunks directly in the database using a concise JSON filter syntax. The same structure powers the REST API, Python SDK (sync + async), folder helpers, UserScope, and knowledge-graph builders, so you can define a filter once and reuse it everywhere.
Prefer server-side filters over client-side post-processing. You’ll reduce bandwidth, improve performance, and keep behavior consistent between endpoints.
Where Filters Apply
You can pass filters (or document_filters) to:
Quick Start
from datetime import datetime
from morphik import Morphik
db = Morphik()
filters = {
"$and": [
{"department": {"$eq": "research"}},
{"priority": {"$gte": 40}},
{"start_date": {"$lte": datetime.now().isoformat()}},
{"tags": {"$contains": {"value": "contract"}}}
]
}
chunks = db.retrieve_chunks("project delta highlights", filters=filters, k=6)
Typed comparisons (numbers, decimals, dates, datetimes) rely on metadata_types. Supply the per-field hints during ingest or metadata updates:
doc = db.ingest_text(
content="SOW for Delta",
metadata={
"priority": 42,
"start_date": "2024-01-15T12:30:00Z",
"end_date": "2024-12-31",
"cost": "1234.56"
},
metadata_types={
"priority": "number",
"start_date": "datetime",
"end_date": "date",
"cost": "decimal"
}
)
If you omit a hint, Morphik infers one automatically for simple scalars, but explicitly declaring types is recommended for reliable range queries.
DateTime and Timezone Behavior
Morphik preserves your timezone format exactly as provided:
| Input | Stored As | Notes |
|---|
datetime(2024, 1, 15) (naive) | "2024-01-15T00:00:00" | No timezone added |
datetime(2024, 1, 15, tzinfo=UTC) | "2024-01-15T00:00:00+00:00" | Timezone preserved |
"2024-01-15T12:00:00Z" (string) | "2024-01-15T12:00:00+00:00" | Z converted to +00:00 |
1705312800 (UNIX timestamp) | "2024-01-15T10:00:00+00:00" | Timestamps are inherently UTC |
SDK Type Reconstruction: When you retrieve a Document via the Python SDK, datetime/date/decimal values in metadata are automatically reconstructed to their Python types using the metadata_types hints. This means you get back what you put in:
from datetime import datetime
# Ingest with naive datetime
doc = db.ingest_text("...", metadata={"created": datetime(2024, 1, 15)})
# Retrieve - metadata["created"] is a datetime object, not a string
retrieved = db.get_document(doc.external_id)
print(type(retrieved.metadata["created"])) # <class 'datetime.datetime'>
print(retrieved.metadata["created"].tzinfo) # None (still naive)
Morphik handles mixed formats correctly - filtering and comparisons work even if some documents have naive datetimes and others have timezone-aware ones:
from datetime import datetime, UTC
# Mixed formats across documents - Morphik handles this fine
db.ingest_text("Doc A", metadata={"ts": datetime(2024, 1, 15)}) # naive
db.ingest_text("Doc B", metadata={"ts": datetime(2024, 6, 15, tzinfo=UTC)}) # aware
# Filtering works correctly
results = db.list_documents(filters={"ts": {"$gte": "2024-05-01"}}) # Returns Doc B
Python comparisons fail with mixed formats. If you retrieve mixed-format datetimes and compare them locally, Python raises TypeError:sorted([naive_dt, aware_dt]) # TypeError: can't compare offset-naive and offset-aware
Recommendation: Stay consistent - pick one format (preferably timezone-aware with UTC) and use it throughout. Let Morphik handle filtering rather than sorting in Python.
Implicit vs Explicit Syntax
- Implicit equality – Bare key/value pairs (
{"status": "active"}) use JSON containment and are ideal for simple matching. They also check whether an array contains the value.
- Explicit operators – Wrap a field in an operator object to unlock typed comparisons, set logic, regex, substring checks, etc. (
{"status": {"$ne": "archived"}}).
Operator Reference
Equality & Comparison
| Operator | Description | Example |
|---|
$eq / implicit value | Equality (also matches scalars in arrays). | {"status": {"$eq": "completed"}} |
$ne | Not equal. | {"status": {"$ne": "archived"}} |
$gt, $gte, $lt, $lte | Greater/less-than comparisons for numbers, decimals, dates, datetimes, and strings ($eq/$ne only). Requires correct metadata_types. | {"priority": {"$gte": 40}}, {"end_date": {"$lt": "2025-01-01"}} |
Set Membership
| Operator | Description | Example |
|---|
$in | Matches any operand in the provided list. | {"status": {"$in": ["completed", "processing"]}} |
$nin | Matches when the value is not in the list. | {"region": {"$nin": ["EU", "LATAM"]}} |
Type & Existence
| Operator | Description | Example |
|---|
$exists | Field must (or must not) exist. Accepts booleans or truthy strings. | {"external_id": {"$exists": true}} |
$type | Field must have one of the supported metadata types (string, number, decimal, datetime, date, boolean, array, object, null). | {"start_date": {"$type": "datetime"}} |
String & Pattern Matching
| Operator | Description | Example |
|---|
$contains | Case-insensitive substring match by default; accepts { "value": "...", "case_sensitive": bool }. Works on scalars and array entries. | {"title": {"$contains": "Q4 Summary"}} |
$regex | PostgreSQL regex match. Accepts a raw string pattern or { "pattern": "...", "flags": "i" } (only the i flag is supported). Works on scalars and arrays. | {"folder": {"$regex": {"pattern": "^fin", "flags": "i"}}} |
Logical Composition
| Operator | Description |
|---|
$and | All nested clauses must match (non-empty list). |
$or | At least one nested clause must match. |
$nor | None of the nested clauses may match (NOT (A OR B)). |
$not | Inverts a single clause. |
Mix logical operators freely with field-level operators for complex expressions.
Common Patterns
Current Window Between Start/End
{
"$and": [
{"start_date": {"$lte": "2024-06-01T00:00:00Z"}},
{"end_date": {"$gte": "2024-06-01T00:00:00Z"}}
]
}
folder = db.get_folder("legal")
scoped = folder.signin("user-42")
filters = {"priority": {"$gte": 50}}
response = scoped.list_documents(filters=filters, include_total_count=True)
Array Membership & Substring
{
"$and": [
{"tags": {"$contains": {"value": "contract"}}},
{"tags": {"$regex": {"pattern": "quarter", "flags": "i"}}}
]
}
Troubleshooting
- “Unsupported metadata filter operator …” – Double-check spelling and operand type (lists for
$in, non-empty arrays for $and, etc.).
- “Metadata field … expects type …” – The server couldn’t coerce the operand to the declared type. Ensure numbers/dates are valid JSON scalars or native Python types before serialization.
- Range query returns nothing – Confirm the target documents were ingested/updated with the corresponding
metadata_types. Re-ingest or call update_document_metadata with the proper type hints if necessary.
Still stuck? Share your filter payload and endpoint at [email protected] or on Discord.