Compare commits

...

24 Commits

Author SHA1 Message Date
perfectra1n
bd25ae77fc docs(search): rewrite benchmark doc for clarity
Consolidated from 12 sections to 4. Leads with the e2e results a
reviewer cares about, follows with scaling data, then lists what
changed and known limitations. Removed redundant tables and
internal-only details.
2026-03-21 14:06:13 -07:00
perfectra1n
9aec8be1c0 docs(search): add full search + fuzzy benchmark sections
Adds end-to-end full search (fastSearch=false) comparison tables
for both fuzzy ON and OFF, plus long queries and realistic typo
recovery benchmarks. Full search multi-token shows 45-65% improvement.
2026-03-21 14:06:10 -07:00
perfectra1n
90ac727250 docs(search): update benchmark comparison with final optimized numbers
All numbers re-measured on the same machine/session after the scoring,
highlighting, and tree walk optimizations. Multi-token autocomplete
now shows 50-70% improvement over main.
2026-03-21 14:06:06 -07:00
perfectra1n
5bc9840825 fix(search): restore toLowerCase in fuzzyMatchWordWithResult
The function has multiple callers (not just smartMatch) so it must
normalize inputs itself. Removing toLowerCase broke fuzzy matching
for the two-phase search path.
2026-03-21 14:06:03 -07:00
perfectra1n
48dd93b94b revert: remove FTS5 content search (no measured end-to-end improvement)
FTS5 query was 32x faster in isolation, but the content scan is only
1-7% of total search time. The JS pipeline (scoring, snippets,
highlighting, tree walk) dominates. The in-memory optimizations in
this PR provide the real gains.

Removes: migration, fts_index service, event wiring, UI option,
integration test. Keeps all in-memory performance optimizations.
2026-03-20 12:50:40 -07:00
perfectra1n
ac231374f6 perf(search): optimize scoring, highlighting, and tree walk
- Remove redundant toLowerCase() before normalizeSearchText() in
  search_result.ts (normalizeSearchText already lowercases)
- Pre-normalize tokens once in addScoreForStrings instead of per-chunk
- Skip edit distance computation entirely when fuzzy matching is
disabled
- Move removeDiacritic() outside the regex while-loop in highlighting
- Cache normalized parent titles per search execution in
note_flat_text.ts
- Use Set for token lookup in searchPathTowardsRoot (O(1) vs O(n))
- Remove redundant toLowerCase in fuzzyMatchWordWithResult (inputs
  from smartMatch are already normalized)
2026-03-20 12:12:08 -07:00
perfectra1n
87fc4e1281 docs(search): add FTS5 benchmark results to performance comparison
Adds real SQLite benchmarks showing FTS5 is 15-33x faster for the
raw content query, though end-to-end improvement is masked by JS
pipeline overhead (scoring, snippets, path walking).
2026-03-20 12:05:08 -07:00
perfectra1n
8fd2cb39c1 fix(search): fix busy connection error in FTS5 index build
Collect rows before inserting — iterateRows() holds an open cursor
that conflicts with writes on the same connection.
2026-03-20 12:05:03 -07:00
perfectra1n
24a01aefe2 feat(search): add user option to enable/disable FTS5 content index 2026-03-20 11:54:42 -07:00
perfectra1n
06fb9c0a6b test(search): add FTS5 integration test 2026-03-20 11:54:39 -07:00
perfectra1n
bc0942180e feat(search): use FTS5 index in NoteContentFulltextExp with sequential fallback
For operators =, !=, and *=*, the search now tries the FTS5 index first
via searchViaFts(). If FTS is unavailable or fails, it falls back to the
original sequential scan. The flat text attribute search is extracted
into its own searchFlatTextAttributes() method and runs after both
paths.
2026-03-20 11:54:36 -07:00
perfectra1n
f358563c27 feat(search): wire FTS index updates to note content changes 2026-03-20 11:54:24 -07:00
perfectra1n
dcaebeea83 feat(search): add FTS5 index service for content search 2026-03-20 11:54:21 -07:00
perfectra1n
ac13af73c5 feat(search): add FTS5 migration for content search index 2026-03-20 11:54:18 -07:00
perfectra1n
ba529d2721 feat(tests): implement search benchmark test... 2026-03-20 11:26:19 -07:00
perfectra1n
f23a7b4842 feat(settings): also allow for fuzzy searching to just be disabled 2026-03-18 11:43:28 -07:00
perfectra1n
5718631889 fix(search): resolve issue with autocomplete with search performance enhancements 2026-03-18 09:46:24 -07:00
Jon Fuller
da3d71d21e Merge branch 'main' into feat/search-perf-take1 2026-03-12 14:57:03 -07:00
perfectra1n
b533546236 fix(search): fix flying bracket 2026-03-12 14:35:47 -07:00
perfectra1n
1c148f407c feat(search): don't toss the entire index after each note change 2026-03-12 14:35:17 -07:00
perfectra1n
9403efa9a1 feat(search): add even some more robust tests 2026-03-12 14:21:36 -07:00
perfectra1n
6a06fc7995 feat(search): get rid of candidate capping 2026-03-12 14:02:23 -07:00
perfectra1n
77733ce205 feat(search): try to rice performance some more 2026-03-11 21:11:55 -07:00
perfectra1n
585b6ccd3e feat(search): try to improve performance 2026-03-11 19:05:44 -07:00
18 changed files with 2017 additions and 108 deletions

View File

@@ -1292,6 +1292,10 @@
"erase_excess_revision_snapshots": "Erase excess revision snapshots now",
"erase_excess_revision_snapshots_prompt": "Excess revision snapshots have been erased."
},
"search": {
"title": "Search",
"enable_fuzzy_matching": "Enable fuzzy matching in search (matches similar words when exact matches are insufficient)"
},
"search_engine": {
"title": "Search Engine",
"custom_search_engine_info": "Custom search engine requires both a name and a URL to be set. If either of these is not set, DuckDuckGo will be used as the default search engine.",

View File

@@ -21,6 +21,7 @@ import TimeSelector from "./components/TimeSelector";
export default function OtherSettings() {
return (
<>
<SearchSettings />
{isElectron() && <>
<SearchEngineSettings />
<TrayOptionsSettings />
@@ -36,6 +37,21 @@ export default function OtherSettings() {
);
}
function SearchSettings() {
const [ fuzzyEnabled, setFuzzyEnabled ] = useTriliumOptionBool("searchEnableFuzzyMatching");
return (
<OptionsSection title={t("search.title")}>
<FormCheckbox
name="search-fuzzy-matching"
label={t("search.enable_fuzzy_matching")}
currentValue={fuzzyEnabled}
onChange={setFuzzyEnabled}
/>
</OptionsSection>
);
}
function SearchEngineSettings() {
const [ customSearchEngineName, setCustomSearchEngineName ] = useTriliumOption("customSearchEngineName");
const [ customSearchEngineUrl, setCustomSearchEngineUrl ] = useTriliumOption("customSearchEngineUrl");

View File

@@ -0,0 +1,306 @@
/**
* Integration-level search profiling test.
*
* Uses the real SQLite database (spec/db/document.db loaded in-memory),
* real sql module, real becca cache, and the full app stack.
*
* Profiles search at large scale (50K+ notes) to match real-world
* performance reports from users with 240K+ notes.
*/
import { Application } from "express";
import { beforeAll, describe, expect, it } from "vitest";
import config from "../src/services/config.js";
let app: Application;
function timed<T>(fn: () => T): [T, number] {
const start = performance.now();
const result = fn();
return [result, performance.now() - start];
}
function randomId(len = 12): string {
const chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
let id = "";
for (let i = 0; i < len; i++) id += chars[Math.floor(Math.random() * chars.length)];
return id;
}
function randomWord(len = 8): string {
const chars = "abcdefghijklmnopqrstuvwxyz";
let w = "";
for (let i = 0; i < len; i++) w += chars[Math.floor(Math.random() * chars.length)];
return w;
}
function generateContent(wordCount: number, keyword?: string): string {
const paragraphs: string[] = [];
let remaining = wordCount;
let injected = false;
while (remaining > 0) {
const n = Math.min(remaining, 30 + Math.floor(Math.random() * 30));
const words: string[] = [];
for (let i = 0; i < n; i++) words.push(randomWord(3 + Math.floor(Math.random() * 10)));
if (keyword && !injected && remaining < wordCount / 2) {
words[Math.floor(words.length / 2)] = keyword;
injected = true;
}
paragraphs.push(`<p>${words.join(" ")}</p>`);
remaining -= n;
}
return paragraphs.join("\n");
}
describe("Search profiling (integration)", () => {
beforeAll(async () => {
config.General.noAuthentication = true;
const buildApp = (await import("../src/app.js")).default;
app = await buildApp();
});
it("large-scale profiling (50K notes)", async () => {
const sql = (await import("../src/services/sql.js")).default;
const becca = (await import("../src/becca/becca.js")).default;
const beccaLoader = (await import("../src/becca/becca_loader.js")).default;
const cls = (await import("../src/services/cls.js")).default;
const searchService = (await import("../src/services/search/services/search.js")).default;
const SearchContext = (await import("../src/services/search/search_context.js")).default;
const beccaService = (await import("../src/becca/becca_service.js")).default;
await new Promise<void>((resolve) => {
cls.init(() => {
const initialNoteCount = Object.keys(becca.notes).length;
console.log(`\n Initial becca notes: ${initialNoteCount}`);
// ── Seed 50K notes with hierarchy ──
// Some folders (depth), some with common keyword "test" in title
const TOTAL_NOTES = 50000;
const FOLDER_COUNT = 500; // 500 folders
const NOTES_PER_FOLDER = (TOTAL_NOTES - FOLDER_COUNT) / FOLDER_COUNT; // ~99 notes per folder
const MATCH_FRACTION = 0.10; // 10% match "test" — ~5000 notes
const CONTENT_WORDS = 500;
const now = new Date().toISOString().replace("T", " ").replace("Z", "+0000");
console.log(` Seeding ${TOTAL_NOTES} notes (${FOLDER_COUNT} folders, ~${NOTES_PER_FOLDER.toFixed(0)} per folder)...`);
const [, seedMs] = timed(() => {
sql.transactional(() => {
const folderIds: string[] = [];
// Create folders under root
for (let f = 0; f < FOLDER_COUNT; f++) {
const noteId = `seed${randomId(8)}`;
const branchId = `seed${randomId(8)}`;
const blobId = `seed${randomId(16)}`;
folderIds.push(noteId);
sql.execute(
`INSERT INTO blobs (blobId, content, dateModified, utcDateModified) VALUES (?, ?, ?, ?)`,
[blobId, `<p>Folder ${f}</p>`, now, now]
);
sql.execute(
`INSERT INTO notes (noteId, title, type, mime, blobId, isProtected, isDeleted,
dateCreated, dateModified, utcDateCreated, utcDateModified)
VALUES (?, ?, 'text', 'text/html', ?, 0, 0, ?, ?, ?, ?)`,
[noteId, `Folder ${f} ${randomWord(5)}`, blobId, now, now, now, now]
);
sql.execute(
`INSERT INTO branches (branchId, noteId, parentNoteId, notePosition, isDeleted, isExpanded, utcDateModified)
VALUES (?, ?, 'root', ?, 0, 0, ?)`,
[branchId, noteId, f * 10, now]
);
}
// Create notes under folders
let noteIdx = 0;
for (let f = 0; f < FOLDER_COUNT; f++) {
const parentId = folderIds[f];
for (let n = 0; n < NOTES_PER_FOLDER; n++) {
const isMatch = noteIdx < TOTAL_NOTES * MATCH_FRACTION;
const noteId = `seed${randomId(8)}`;
const branchId = `seed${randomId(8)}`;
const blobId = `seed${randomId(16)}`;
const title = isMatch
? `Test Document ${noteIdx} ${randomWord(6)}`
: `Note ${noteIdx} ${randomWord(6)} ${randomWord(5)}`;
const content = generateContent(CONTENT_WORDS, isMatch ? "test" : undefined);
sql.execute(
`INSERT INTO blobs (blobId, content, dateModified, utcDateModified) VALUES (?, ?, ?, ?)`,
[blobId, content, now, now]
);
sql.execute(
`INSERT INTO notes (noteId, title, type, mime, blobId, isProtected, isDeleted,
dateCreated, dateModified, utcDateCreated, utcDateModified)
VALUES (?, ?, 'text', 'text/html', ?, 0, 0, ?, ?, ?, ?)`,
[noteId, title, blobId, now, now, now, now]
);
sql.execute(
`INSERT INTO branches (branchId, noteId, parentNoteId, notePosition, isDeleted, isExpanded, utcDateModified)
VALUES (?, ?, ?, ?, 0, 0, ?)`,
[branchId, noteId, parentId, n * 10, now]
);
noteIdx++;
}
}
});
});
console.log(` SQL seeding: ${seedMs.toFixed(0)}ms`);
const [, reloadMs] = timed(() => beccaLoader.load());
const totalNotes = Object.keys(becca.notes).length;
console.log(` Becca reload: ${reloadMs.toFixed(0)}ms Total notes: ${totalNotes}`);
// ── Warm caches ──
searchService.searchNotesForAutocomplete("test", true);
// ════════════════════════════════════════════
// PROFILING AT SCALE
// ════════════════════════════════════════════
console.log(`\n ════ PROFILING (${totalNotes} notes) ════\n`);
// 1. getCandidateNotes cost (the full-scan bottleneck)
const allNotes = Object.values(becca.notes);
const [, flatScanMs] = timed(() => {
let count = 0;
for (const note of allNotes) {
const ft = note.getFlatText();
if (ft.includes("test")) count++;
}
return count;
});
console.log(` getFlatText + includes scan (${allNotes.length} notes): ${flatScanMs.toFixed(1)}ms`);
// 2. Full findResultsWithQuery (includes candidate scan + parent walk + scoring)
const findTimes: number[] = [];
let findResultCount = 0;
for (let i = 0; i < 3; i++) {
const [r, ms] = timed(() =>
searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true }))
);
findTimes.push(ms);
findResultCount = r.length;
}
const findAvg = findTimes.reduce((a, b) => a + b, 0) / findTimes.length;
console.log(` findResultsWithQuery (fast): avg ${findAvg.toFixed(1)}ms (${findResultCount} results)`);
// 3. Exact-only (no fuzzy)
const exactTimes: number[] = [];
for (let i = 0; i < 3; i++) {
const [, ms] = timed(() =>
searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true, enableFuzzyMatching: false }))
);
exactTimes.push(ms);
}
const exactAvg = exactTimes.reduce((a, b) => a + b, 0) / exactTimes.length;
console.log(` findResultsWithQuery (exact): avg ${exactAvg.toFixed(1)}ms`);
console.log(` Fuzzy overhead: ${(findAvg - exactAvg).toFixed(1)}ms`);
// 4. SearchResult construction + computeScore cost (isolated)
const results = searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: true }));
console.log(` Total results before trim: ${results.length}`);
const [, scoreAllMs] = timed(() => {
for (const r of results) r.computeScore("test", ["test"], true);
});
console.log(` computeScore × ${results.length}: ${scoreAllMs.toFixed(1)}ms (${(scoreAllMs / results.length).toFixed(3)}ms/result)`);
// 5. getNoteTitleForPath for all results
const [, pathTitleMs] = timed(() => {
for (const r of results) beccaService.getNoteTitleForPath(r.notePathArray);
});
console.log(` getNoteTitleForPath × ${results.length}: ${pathTitleMs.toFixed(1)}ms`);
// 6. Content snippet extraction (only 200)
const trimmed = results.slice(0, 200);
const [, snippetMs] = timed(() => {
for (const r of trimmed) {
r.contentSnippet = searchService.extractContentSnippet(r.noteId, ["test"]);
}
});
console.log(` extractContentSnippet × 200: ${snippetMs.toFixed(1)}ms`);
// 7. Highlighting (only 200)
const [, hlMs] = timed(() => {
searchService.highlightSearchResults(trimmed, ["test"]);
});
console.log(` highlightSearchResults × 200: ${hlMs.toFixed(1)}ms`);
// 7b. getBestNotePath cost (used by fast path)
const sampleNotes = Object.values(becca.notes).filter(n => n.title.startsWith("Test Document")).slice(0, 1000);
const [, bestPathMs] = timed(() => {
for (const n of sampleNotes) n.getBestNotePath();
});
console.log(` getBestNotePath × ${sampleNotes.length}: ${bestPathMs.toFixed(1)}ms (${(bestPathMs/sampleNotes.length).toFixed(3)}ms/note)`);
// 8. Full autocomplete end-to-end
const autoTimes: number[] = [];
let autoCount = 0;
for (let i = 0; i < 3; i++) {
const [r, ms] = timed(() =>
searchService.searchNotesForAutocomplete("test", true)
);
autoTimes.push(ms);
autoCount = r.length;
}
const autoAvg = autoTimes.reduce((a, b) => a + b, 0) / autoTimes.length;
const autoMin = Math.min(...autoTimes);
console.log(`\n ★ FULL AUTOCOMPLETE: avg ${autoAvg.toFixed(1)}ms min ${autoMin.toFixed(1)}ms (${autoCount} results)`);
// 9. With a less common search term (fewer matches)
const rareTimes: number[] = [];
let rareCount = 0;
for (let i = 0; i < 3; i++) {
const [r, ms] = timed(() =>
searchService.searchNotesForAutocomplete("leitfaden", true)
);
rareTimes.push(ms);
rareCount = r.length;
}
const rareAvg = rareTimes.reduce((a, b) => a + b, 0) / rareTimes.length;
console.log(` Autocomplete "leitfaden": avg ${rareAvg.toFixed(1)}ms (${rareCount} results)`);
// 10. Full search (fastSearch=false) — the 2.7s bottleneck
console.log(`\n ── Full search (fastSearch=false) ──`);
const fullTimes: number[] = [];
let fullCount = 0;
for (let i = 0; i < 2; i++) {
const [r, ms] = timed(() =>
searchService.findResultsWithQuery("test", new SearchContext({ fastSearch: false }))
);
fullTimes.push(ms);
fullCount = r.length;
}
const fullAvg = fullTimes.reduce((a, b) => a + b, 0) / fullTimes.length;
console.log(` Full search (flat + SQL): avg ${fullAvg.toFixed(1)}ms (${fullCount} results)`);
// 11. SQL content scan alone
const [scanCount, scanMs] = timed(() => {
let count = 0;
for (const row of sql.iterateRows<{ content: Buffer | string }>(`
SELECT noteId, type, mime, content, isProtected
FROM notes JOIN blobs USING (blobId)
WHERE type IN ('text', 'code', 'mermaid', 'canvas', 'mindMap')
AND isDeleted = 0
AND LENGTH(content) < 2097152`)) {
count++;
}
return count;
});
console.log(` Raw SQL scan (${scanCount} rows): ${scanMs.toFixed(1)}ms`);
// ── Summary ──
console.log(`\n ════ SUMMARY ════`);
console.log(` Notes: ${totalNotes} | Matches: ${findResultCount} | Hierarchy depth: 3 (root → folder → note)`);
console.log(` ──────────────────────────────────`);
console.log(` Autocomplete (fast): ${autoAvg.toFixed(1)}ms`);
console.log(` findResults: ${findAvg.toFixed(1)}ms (${((findAvg/autoAvg)*100).toFixed(0)}%)`);
console.log(` snippets+highlight: ${(snippetMs + hlMs).toFixed(1)}ms (${(((snippetMs+hlMs)/autoAvg)*100).toFixed(0)}%)`);
console.log(` Full search: ${fullAvg.toFixed(1)}ms`);
resolve();
});
});
}, 600_000);
});

View File

@@ -31,9 +31,22 @@ export default class Becca {
allNoteSetCache: NoteSet | null;
/**
* Pre-built parallel arrays for fast flat text scanning in search.
* Avoids per-note property access overhead when iterating 50K+ notes.
* Supports incremental updates: when individual notes change, only their
* entries are rebuilt rather than the entire index.
*/
flatTextIndex: { notes: BNote[], flatTexts: string[], noteIdToIdx: Map<string, number> } | null;
/** NoteIds whose flat text needs to be recomputed in the index. */
dirtyFlatTextNoteIds: Set<string>;
constructor() {
this.reset();
this.dirtyFlatTextNoteIds = new Set();
this.allNoteSetCache = null;
this.flatTextIndex = null;
this.reset();
}
reset() {
@@ -239,6 +252,59 @@ export default class Becca {
/** Should be called when the set of all non-skeleton notes changes (added/removed) */
dirtyNoteSetCache() {
this.allNoteSetCache = null;
// Full rebuild needed since the note set itself changed
this.flatTextIndex = null;
this.dirtyFlatTextNoteIds.clear();
}
/** Mark a single note's flat text as needing recomputation in the index. */
dirtyNoteFlatText(noteId: string) {
if (this.flatTextIndex) {
// Index exists — schedule an incremental update
this.dirtyFlatTextNoteIds.add(noteId);
}
// If flatTextIndex is null, full rebuild will happen on next access anyway
}
/**
* Returns pre-built parallel arrays of notes and their flat texts for fast scanning.
* The flat texts are already normalized (lowercase, diacritics removed).
* Supports incremental updates: when individual notes are dirtied, only their
* entries are recomputed rather than rebuilding the entire index.
*/
getFlatTextIndex(): { notes: BNote[], flatTexts: string[], noteIdToIdx: Map<string, number> } {
if (!this.flatTextIndex) {
const allNoteSet = this.getAllNoteSet();
const notes: BNote[] = [];
const flatTexts: string[] = [];
const noteIdToIdx = new Map<string, number>();
for (const note of allNoteSet.notes) {
noteIdToIdx.set(note.noteId, notes.length);
notes.push(note);
flatTexts.push(note.getFlatText());
}
this.flatTextIndex = { notes, flatTexts, noteIdToIdx };
this.dirtyFlatTextNoteIds.clear();
} else if (this.dirtyFlatTextNoteIds.size > 0) {
// Incremental update: only recompute flat texts for dirtied notes
const { flatTexts, noteIdToIdx } = this.flatTextIndex;
for (const noteId of this.dirtyFlatTextNoteIds) {
const idx = noteIdToIdx.get(noteId);
if (idx !== undefined) {
const note = this.notes[noteId];
if (note) {
flatTexts[idx] = note.getFlatText();
}
}
}
this.dirtyFlatTextNoteIds.clear();
}
return this.flatTextIndex;
}
getAllNoteSet() {

View File

@@ -6,6 +6,7 @@ import dateUtils from "../../services/date_utils.js";
import promotedAttributeDefinitionParser from "../../services/promoted_attribute_definition_parser.js";
import sanitizeAttributeName from "../../services/sanitize_attribute_name.js";
import type { AttributeRow, AttributeType } from "@triliumnext/commons";
import { normalize } from "../../services/utils.js";
interface SavingOpts {
skipValidation?: boolean;
@@ -34,6 +35,11 @@ class BAttribute extends AbstractBeccaEntity<BAttribute> {
value!: string;
isInheritable!: boolean;
/** Pre-normalized (lowercase, diacritics removed) name for search. */
normalizedName!: string;
/** Pre-normalized (lowercase, diacritics removed) value for search. */
normalizedValue!: string;
constructor(row?: AttributeRow) {
super();
@@ -59,6 +65,10 @@ class BAttribute extends AbstractBeccaEntity<BAttribute> {
this.isInheritable = !!isInheritable;
this.utcDateModified = utcDateModified;
// Pre-compute normalized forms for search (avoids repeated normalize() calls in hot loops)
this.normalizedName = normalize(this.name);
this.normalizedValue = normalize(this.value);
return this;
}

View File

@@ -790,6 +790,9 @@ class BNote extends AbstractBeccaEntity<BNote> {
this.__attributeCache = null;
this.__inheritableAttributeCache = null;
this.__ancestorCache = null;
// Mark only this note's flat text as dirty for incremental index update
this.becca.dirtyNoteFlatText(this.noteId);
}
invalidateSubTree(path: string[] = []) {

View File

@@ -97,6 +97,7 @@ const ALLOWED_OPTIONS = new Set<OptionNames>([
"layoutOrientation",
"backgroundEffects",
"allowedHtmlTags",
"searchEnableFuzzyMatching",
"redirectBareDomain",
"showLoginInShareTheme",
"splitEditorOrientation",

View File

@@ -198,6 +198,9 @@ const defaultOptions: DefaultOption[] = [
isSynced: true
},
// Search settings
{ name: "searchEnableFuzzyMatching", value: "true", isSynced: true },
// Share settings
{ name: "redirectBareDomain", value: "false", isSynced: true },
{ name: "showLoginInShareTheme", value: "false", isSynced: true },

View File

@@ -7,7 +7,7 @@ import Expression from "./expression.js";
import NoteSet from "../note_set.js";
import becca from "../../../becca/becca.js";
import { normalize } from "../../utils.js";
import { normalizeSearchText, fuzzyMatchWord, fuzzyMatchWordWithResult } from "../utils/text_utils.js";
import { normalizeSearchText, fuzzyMatchWordWithResult } from "../utils/text_utils.js";
import beccaService from "../../../becca/becca_service.js";
class NoteFlatTextExp extends Expression {
@@ -23,6 +23,18 @@ class NoteFlatTextExp extends Expression {
execute(inputNoteSet: NoteSet, executionContext: any, searchContext: SearchContext) {
const resultNoteSet = new NoteSet();
// Cache normalized titles to avoid redundant normalize+getNoteTitle calls
const titleCache = new Map<string, string>();
const getNormalizedTitle = (noteId: string, parentNoteId: string): string => {
const key = `${noteId}-${parentNoteId}`;
let cached = titleCache.get(key);
if (cached === undefined) {
cached = normalizeSearchText(beccaService.getNoteTitle(noteId, parentNoteId));
titleCache.set(key, cached);
}
return cached;
};
/**
* @param note
* @param remainingTokens - tokens still needed to be found in the path towards root
@@ -38,10 +50,8 @@ class NoteFlatTextExp extends Expression {
const noteId = resultPath[resultPath.length - 1];
if (!resultNoteSet.hasNoteId(noteId)) {
// we could get here from multiple paths, the first one wins because the paths
// are sorted by importance
// Snapshot takenPath since it's mutable
executionContext.noteIdToNotePath[noteId] = resultPath;
resultNoteSet.add(becca.notes[noteId]);
}
}
@@ -50,45 +60,40 @@ class NoteFlatTextExp extends Expression {
}
if (note.parents.length === 0 || note.noteId === "root") {
// we've reached root, but there are still remaining tokens -> this candidate note produced no result
return;
}
const foundAttrTokens: string[] = [];
for (const token of remainingTokens) {
// Add defensive checks for undefined properties
const typeMatches = note.type && note.type.includes(token);
const mimeMatches = note.mime && note.mime.includes(token);
if (typeMatches || mimeMatches) {
if ((note.type && note.type.includes(token)) ||
(note.mime && note.mime.includes(token))) {
foundAttrTokens.push(token);
}
}
for (const attribute of note.getOwnedAttributes()) {
const normalizedName = normalizeSearchText(attribute.name);
const normalizedValue = normalizeSearchText(attribute.value);
for (const token of remainingTokens) {
if (normalizedName.includes(token) || normalizedValue.includes(token)) {
if (attribute.normalizedName.includes(token) || attribute.normalizedValue.includes(token)) {
foundAttrTokens.push(token);
}
}
}
for (const parentNote of note.parents) {
const title = normalizeSearchText(beccaService.getNoteTitle(note.noteId, parentNote.noteId));
const foundTokens: string[] = foundAttrTokens.slice();
const title = getNormalizedTitle(note.noteId, parentNote.noteId);
// Use Set for O(1) lookup instead of Array.includes() which is O(n)
const foundTokenSet = new Set<string>(foundAttrTokens);
for (const token of remainingTokens) {
if (this.smartMatch(title, token, searchContext)) {
foundTokens.push(token);
foundTokenSet.add(token);
}
}
if (foundTokens.length > 0) {
const newRemainingTokens = remainingTokens.filter((token) => !foundTokens.includes(token));
if (foundTokenSet.size > 0) {
const newRemainingTokens = remainingTokens.filter((token) => !foundTokenSet.has(token));
searchPathTowardsRoot(parentNote, newRemainingTokens, [note.noteId, ...takenPath]);
} else {
@@ -99,6 +104,22 @@ class NoteFlatTextExp extends Expression {
const candidateNotes = this.getCandidateNotes(inputNoteSet, searchContext);
// Fast path for single-token autocomplete searches:
// Skip the expensive recursive parent walk and just use getBestNotePath().
// The flat text already matched, so we know the token is present.
if (this.tokens.length === 1 && searchContext.autocomplete) {
for (const note of candidateNotes) {
if (!resultNoteSet.hasNoteId(note.noteId)) {
const notePath = note.getBestNotePath();
if (notePath) {
executionContext.noteIdToNotePath[note.noteId] = notePath;
resultNoteSet.add(note);
}
}
}
return resultNoteSet;
}
for (const note of candidateNotes) {
// autocomplete should be able to find notes by their noteIds as well (only leafs)
if (this.tokens.length === 1 && note.noteId.toLowerCase() === this.tokens[0]) {
@@ -112,13 +133,13 @@ class NoteFlatTextExp extends Expression {
// Add defensive checks for undefined properties
const typeMatches = note.type && note.type.includes(token);
const mimeMatches = note.mime && note.mime.includes(token);
if (typeMatches || mimeMatches) {
foundAttrTokens.push(token);
}
for (const attribute of note.ownedAttributes) {
if (normalizeSearchText(attribute.name).includes(token) || normalizeSearchText(attribute.value).includes(token)) {
if (attribute.normalizedName.includes(token) || attribute.normalizedValue.includes(token)) {
foundAttrTokens.push(token);
}
}
@@ -165,10 +186,25 @@ class NoteFlatTextExp extends Expression {
getCandidateNotes(noteSet: NoteSet, searchContext?: SearchContext): BNote[] {
const candidateNotes: BNote[] = [];
for (const note of noteSet.notes) {
const normalizedFlatText = normalizeSearchText(note.getFlatText());
// Use the pre-built flat text index for fast scanning.
// This provides pre-computed flat texts in a parallel array, avoiding
// per-note property access overhead at large scale (50K+ notes).
const { notes: indexNotes, flatTexts } = becca.getFlatTextIndex();
// Build a set for quick membership check when noteSet isn't the full set
const isFullSet = noteSet.notes.length === indexNotes.length;
for (let i = 0; i < indexNotes.length; i++) {
const note = indexNotes[i];
// Skip notes not in the input set (only check when not using the full set)
if (!isFullSet && !noteSet.hasNoteId(note.noteId)) {
continue;
}
const flatText = flatTexts[i];
for (const token of this.tokens) {
if (this.smartMatch(normalizedFlatText, token, searchContext)) {
if (this.smartMatch(flatText, token, searchContext)) {
candidateNotes.push(note);
break;
}

View File

@@ -1,6 +1,7 @@
"use strict";
import hoistedNoteService from "../hoisted_note.js";
import optionService from "../options.js";
import type { SearchParams } from "./services/types.js";
class SearchContext {
@@ -18,6 +19,8 @@ class SearchContext {
debug?: boolean;
debugInfo: {} | null;
fuzzyAttributeSearch: boolean;
/** When true, skip the two-phase fuzzy fallback and use the single-token fast path. */
autocomplete: boolean;
enableFuzzyMatching: boolean; // Controls whether fuzzy matching is enabled for this search phase
highlightedTokens: string[];
originalQuery: string;
@@ -46,7 +49,12 @@ class SearchContext {
this.debug = params.debug;
this.debugInfo = null;
this.fuzzyAttributeSearch = !!params.fuzzyAttributeSearch;
this.enableFuzzyMatching = true; // Default to true for backward compatibility
this.autocomplete = !!params.autocomplete;
try {
this.enableFuzzyMatching = optionService.getOptionBool("searchEnableFuzzyMatching");
} catch {
this.enableFuzzyMatching = true; // Default to true if option not yet initialized
}
this.highlightedTokens = [];
this.originalQuery = "";
this.fulltextQuery = ""; // complete fulltext part

View File

@@ -59,8 +59,9 @@ class SearchResult {
this.fuzzyScore = 0; // Reset fuzzy score tracking
const note = becca.notes[this.noteId];
const normalizedQuery = normalizeSearchText(fulltextQuery.toLowerCase());
const normalizedTitle = normalizeSearchText(note.title.toLowerCase());
// normalizeSearchText already lowercases — no need for .toLowerCase() first
const normalizedQuery = normalizeSearchText(fulltextQuery);
const normalizedTitle = normalizeSearchText(note.title);
// Note ID exact match, much higher score
if (note.noteId.toLowerCase() === fulltextQuery) {
@@ -91,35 +92,37 @@ class SearchResult {
}
addScoreForStrings(tokens: string[], str: string, factor: number, enableFuzzyMatching: boolean = true) {
const normalizedStr = normalizeSearchText(str.toLowerCase());
// normalizeSearchText already lowercases — no need for .toLowerCase() first
const normalizedStr = normalizeSearchText(str);
const chunks = normalizedStr.split(" ");
// Pre-normalize tokens once instead of per-chunk
const normalizedTokens = tokens.map(t => normalizeSearchText(t));
let tokenScore = 0;
for (const chunk of chunks) {
for (const token of tokens) {
const normalizedToken = normalizeSearchText(token.toLowerCase());
for (let ti = 0; ti < normalizedTokens.length; ti++) {
const normalizedToken = normalizedTokens[ti];
if (chunk === normalizedToken) {
tokenScore += SCORE_WEIGHTS.TOKEN_EXACT_MATCH * token.length * factor;
tokenScore += SCORE_WEIGHTS.TOKEN_EXACT_MATCH * tokens[ti].length * factor;
} else if (chunk.startsWith(normalizedToken)) {
tokenScore += SCORE_WEIGHTS.TOKEN_PREFIX_MATCH * token.length * factor;
tokenScore += SCORE_WEIGHTS.TOKEN_PREFIX_MATCH * tokens[ti].length * factor;
} else if (chunk.includes(normalizedToken)) {
tokenScore += SCORE_WEIGHTS.TOKEN_CONTAINS_MATCH * token.length * factor;
} else {
// Try fuzzy matching for individual tokens with caps applied
tokenScore += SCORE_WEIGHTS.TOKEN_CONTAINS_MATCH * tokens[ti].length * factor;
} else if (enableFuzzyMatching &&
normalizedToken.length >= FUZZY_SEARCH_CONFIG.MIN_FUZZY_TOKEN_LENGTH &&
this.fuzzyScore < SCORE_WEIGHTS.MAX_TOTAL_FUZZY_SCORE) {
// Only compute edit distance when fuzzy matching is enabled
const editDistance = calculateOptimizedEditDistance(chunk, normalizedToken, FUZZY_SEARCH_CONFIG.MAX_EDIT_DISTANCE);
if (editDistance <= FUZZY_SEARCH_CONFIG.MAX_EDIT_DISTANCE &&
normalizedToken.length >= FUZZY_SEARCH_CONFIG.MIN_FUZZY_TOKEN_LENGTH &&
this.fuzzyScore < SCORE_WEIGHTS.MAX_TOTAL_FUZZY_SCORE) {
if (editDistance <= FUZZY_SEARCH_CONFIG.MAX_EDIT_DISTANCE) {
const fuzzyWeight = SCORE_WEIGHTS.TOKEN_FUZZY_MATCH * (1 - editDistance / FUZZY_SEARCH_CONFIG.MAX_EDIT_DISTANCE);
// Apply caps: limit token length multiplier and per-token contribution
const cappedTokenLength = Math.min(token.length, SCORE_WEIGHTS.MAX_FUZZY_TOKEN_LENGTH_MULTIPLIER);
const cappedTokenLength = Math.min(tokens[ti].length, SCORE_WEIGHTS.MAX_FUZZY_TOKEN_LENGTH_MULTIPLIER);
const fuzzyTokenScore = Math.min(
fuzzyWeight * cappedTokenLength * factor,
SCORE_WEIGHTS.MAX_FUZZY_SCORE_PER_TOKEN
);
tokenScore += fuzzyTokenScore;
this.fuzzyScore += fuzzyTokenScore;
}

View File

@@ -1,6 +1,5 @@
"use strict";
import normalizeString from "normalize-strings";
import lex from "./lex.js";
import handleParens from "./handle_parens.js";
import parse from "./parse.js";
@@ -8,7 +7,7 @@ import SearchResult from "../search_result.js";
import SearchContext from "../search_context.js";
import becca from "../../../becca/becca.js";
import beccaService from "../../../becca/becca_service.js";
import { normalize, escapeHtml, escapeRegExp } from "../../utils.js";
import { normalize, removeDiacritic, escapeHtml, escapeRegExp } from "../../utils.js";
import log from "../../log.js";
import hoistedNoteService from "../../hoisted_note.js";
import type BNote from "../../../becca/entities/bnote.js";
@@ -17,7 +16,6 @@ import type { SearchParams, TokenStructure } from "./types.js";
import type Expression from "../expressions/expression.js";
import sql from "../../sql.js";
import scriptService from "../../script.js";
import striptags from "striptags";
import protectedSessionService from "../../protected_session.js";
export interface SearchNoteResult {
@@ -250,23 +248,30 @@ function findResultsWithExpression(expression: Expression, searchContext: Search
return performSearch(expression, searchContext, false);
}
// For autocomplete searches, skip the expensive two-phase fuzzy fallback.
// The user is typing and will refine their query — exact matching is
// sufficient and avoids a second full scan of all notes.
if (searchContext.autocomplete) {
return performSearch(expression, searchContext, false);
}
// Phase 1: Try exact matches first (without fuzzy matching)
const exactResults = performSearch(expression, searchContext, false);
// Check if we have sufficient high-quality results
const minResultThreshold = 5;
const minScoreForQuality = 10; // Minimum score to consider a result "high quality"
const highQualityResults = exactResults.filter(result => result.score >= minScoreForQuality);
// If we have enough high-quality exact matches, return them
if (highQualityResults.length >= minResultThreshold) {
return exactResults;
}
// Phase 2: Add fuzzy matching as fallback when exact matches are insufficient
const fuzzyResults = performSearch(expression, searchContext, true);
// Merge results, ensuring exact matches always rank higher than fuzzy matches
return mergeExactAndFuzzyResults(exactResults, fuzzyResults);
}
@@ -448,7 +453,7 @@ function extractContentSnippet(noteId: string, searchTokens: string[], maxLength
try {
let content = note.getContent();
if (!content || typeof content !== "string") {
return "";
}
@@ -464,77 +469,66 @@ function extractContentSnippet(noteId: string, searchTokens: string[], maxLength
return ""; // Protected but no session available
}
// Strip HTML tags for text notes
// Strip HTML tags for text notes — use fast regex for snippet extraction
// (striptags library is ~18x slower and not needed for search snippets)
if (note.type === "text") {
content = striptags(content);
content = content.replace(/<[^>]*>/g, "");
}
// Normalize whitespace while preserving paragraph breaks
// First, normalize multiple newlines to double newlines (paragraph breaks)
content = content.replace(/\n\s*\n/g, "\n\n");
// Then normalize spaces within lines
content = content.split('\n').map(line => line.replace(/\s+/g, " ").trim()).join('\n');
// Finally trim the whole content
content = content.trim();
if (!content) {
return "";
}
// Try to find a snippet around the first matching token
const normalizedContent = normalizeString(content.toLowerCase());
// Find match position using normalize on the raw stripped content.
// We use a single normalize() pass — no need for expensive whitespace
// normalization just to find the match index.
const normalizedContent = normalize(content);
const normalizedTokens = searchTokens.map(token => normalize(token));
let snippetStart = 0;
let matchFound = false;
for (const token of searchTokens) {
const normalizedToken = normalizeString(token.toLowerCase());
for (const normalizedToken of normalizedTokens) {
const matchIndex = normalizedContent.indexOf(normalizedToken);
if (matchIndex !== -1) {
// Center the snippet around the match
snippetStart = Math.max(0, matchIndex - maxLength / 2);
matchFound = true;
break;
}
}
// Extract snippet
let snippet = content.substring(snippetStart, snippetStart + maxLength);
// Extract a snippet region from the raw content, then clean only that
const snippetRegion = content.substring(snippetStart, snippetStart + maxLength + 100);
// If snippet contains linebreaks, limit to max 4 lines and override character limit
// Normalize whitespace only on the small snippet region
let snippet = snippetRegion
.replace(/\n\s*\n/g, "\n\n")
.replace(/[ \t]+/g, " ")
.trim()
.substring(0, maxLength);
// If snippet contains linebreaks, limit to max 4 lines
const lines = snippet.split('\n');
if (lines.length > 4) {
// Find which lines contain the search tokens to ensure they're included
const normalizedLines = lines.map(line => normalizeString(line.toLowerCase()));
const normalizedTokens = searchTokens.map(token => normalizeString(token.toLowerCase()));
// Find the first line that contains a search token
let firstMatchLine = -1;
for (let i = 0; i < normalizedLines.length; i++) {
if (normalizedTokens.some(token => normalizedLines[i].includes(token))) {
for (let i = 0; i < lines.length; i++) {
const normalizedLine = normalize(lines[i]);
if (normalizedTokens.some(token => normalizedLine.includes(token))) {
firstMatchLine = i;
break;
}
}
if (firstMatchLine !== -1) {
// Center the 4-line window around the first match
// Try to show 1 line before and 2 lines after the match
const startLine = Math.max(0, firstMatchLine - 1);
const endLine = Math.min(lines.length, startLine + 4);
snippet = lines.slice(startLine, endLine).join('\n');
} else {
// No match found in lines (shouldn't happen), just take first 4
snippet = lines.slice(0, 4).join('\n');
}
// Add ellipsis if we truncated lines
snippet = snippet + "...";
} else if (lines.length > 1) {
// For multi-line snippets that are 4 or fewer lines, keep them as-is
// No need to truncate
} else {
// Single line content - apply original word boundary logic
// Try to start/end at word boundaries
} else if (lines.length <= 1) {
// Single line content - apply word boundary logic
if (snippetStart > 0) {
const firstSpace = snippet.search(/\s/);
if (firstSpace > 0 && firstSpace < 20) {
@@ -542,7 +536,7 @@ function extractContentSnippet(noteId: string, searchTokens: string[], maxLength
}
snippet = "..." + snippet;
}
if (snippetStart + maxLength < content.length) {
const lastSpace = snippet.search(/\s[^\s]*$/);
if (lastSpace > snippet.length - 20 && lastSpace > 0) {
@@ -582,7 +576,7 @@ function extractAttributeSnippet(noteId: string, searchTokens: string[], maxLeng
// Check if any search token matches the attribute name or value
const hasMatch = searchTokens.some(token => {
const normalizedToken = normalizeString(token.toLowerCase());
const normalizedToken = normalize(token);
return attrName.includes(normalizedToken) || attrValue.includes(normalizedToken);
});
@@ -650,7 +644,8 @@ function searchNotesForAutocomplete(query: string, fastSearch: boolean = true) {
includeHiddenNotes: true,
fuzzyAttributeSearch: true,
ignoreInternalAttributes: true,
ancestorNoteId: hoistedNoteService.isHoistedInHiddenSubtree() ? "root" : hoistedNoteService.getHoistedNoteId()
ancestorNoteId: hoistedNoteService.isHoistedInHiddenSubtree() ? "root" : hoistedNoteService.getHoistedNoteId(),
autocomplete: true
});
const allSearchResults = findResultsWithQuery(query, searchContext);
@@ -727,37 +722,40 @@ function highlightSearchResults(searchResults: SearchResult[], highlightedTokens
}
for (const result of searchResults) {
// Reset token
const tokenRegex = new RegExp(escapeRegExp(token), "gi");
let match;
// Highlight in note path title
if (result.highlightedNotePathTitle) {
const titleRegex = new RegExp(escapeRegExp(token), "gi");
while ((match = titleRegex.exec(normalizeString(result.highlightedNotePathTitle))) !== null) {
// Compute diacritic-free version ONCE before the loop, not on every iteration
let titleNoDiacritics = removeDiacritic(result.highlightedNotePathTitle);
while ((match = titleRegex.exec(titleNoDiacritics)) !== null) {
result.highlightedNotePathTitle = wrapText(result.highlightedNotePathTitle, match.index, token.length, "{", "}");
// 2 characters are added, so we need to adjust the index
// 2 characters are added, so we need to adjust the index and re-derive
titleRegex.lastIndex += 2;
titleNoDiacritics = removeDiacritic(result.highlightedNotePathTitle);
}
}
// Highlight in content snippet
if (result.highlightedContentSnippet) {
const contentRegex = new RegExp(escapeRegExp(token), "gi");
while ((match = contentRegex.exec(normalizeString(result.highlightedContentSnippet))) !== null) {
let contentNoDiacritics = removeDiacritic(result.highlightedContentSnippet);
while ((match = contentRegex.exec(contentNoDiacritics)) !== null) {
result.highlightedContentSnippet = wrapText(result.highlightedContentSnippet, match.index, token.length, "{", "}");
// 2 characters are added, so we need to adjust the index
contentRegex.lastIndex += 2;
contentNoDiacritics = removeDiacritic(result.highlightedContentSnippet);
}
}
// Highlight in attribute snippet
if (result.highlightedAttributeSnippet) {
const attributeRegex = new RegExp(escapeRegExp(token), "gi");
while ((match = attributeRegex.exec(normalizeString(result.highlightedAttributeSnippet))) !== null) {
let attrNoDiacritics = removeDiacritic(result.highlightedAttributeSnippet);
while ((match = attributeRegex.exec(attrNoDiacritics)) !== null) {
result.highlightedAttributeSnippet = wrapText(result.highlightedAttributeSnippet, match.index, token.length, "{", "}");
// 2 characters are added, so we need to adjust the index
attributeRegex.lastIndex += 2;
attrNoDiacritics = removeDiacritic(result.highlightedAttributeSnippet);
}
}
}

View File

@@ -0,0 +1,675 @@
/**
* Comprehensive search benchmark suite.
*
* Covers many scenarios:
* - Single-token, multi-token, phrase-like queries
* - Fuzzy matching enabled vs disabled
* - Autocomplete vs full search
* - Diacritics / unicode queries
* - No-match queries
* - Varying note counts (1K, 5K, 10K, 20K)
* - Warm cache vs cold cache
*
* All times are in-memory (monkeypatched getContent, no real SQL).
*/
import { describe, it, expect, afterEach } from "vitest";
import searchService from "./search.js";
import BNote from "../../../becca/entities/bnote.js";
import BBranch from "../../../becca/entities/bbranch.js";
import SearchContext from "../search_context.js";
import becca from "../../../becca/becca.js";
import { NoteBuilder, note } from "../../../test/becca_mocking.js";
// ── helpers ──────────────────────────────────────────────────────────
function randomWord(len = 6): string {
const chars = "abcdefghijklmnopqrstuvwxyz";
let word = "";
for (let i = 0; i < len; i++) {
word += chars[Math.floor(Math.random() * chars.length)];
}
return word;
}
function generateHtmlContent(wordCount: number, includeKeywords = false, keywords?: string[]): string {
const paragraphs: string[] = [];
let wordsRemaining = wordCount;
const kws = keywords ?? [];
while (wordsRemaining > 0) {
const paraWords = Math.min(wordsRemaining, 20 + Math.floor(Math.random() * 40));
const words: string[] = [];
for (let i = 0; i < paraWords; i++) {
words.push(randomWord(3 + Math.floor(Math.random() * 10)));
}
if (includeKeywords && paragraphs.length === 2) {
for (let k = 0; k < kws.length; k++) {
const pos = Math.min(words.length - 1, Math.floor((words.length / (kws.length + 1)) * (k + 1)));
words[pos] = kws[k];
}
}
paragraphs.push(`<p>${words.join(" ")}</p>`);
wordsRemaining -= paraWords;
}
return `<html><body>${paragraphs.join("\n")}</body></html>`;
}
function timed<T>(fn: () => T): [T, number] {
const start = performance.now();
const result = fn();
return [result, performance.now() - start];
}
function avg(nums: number[]): number {
return nums.reduce((a, b) => a + b, 0) / nums.length;
}
function min(nums: number[]): number {
return Math.min(...nums);
}
// ── dataset builder ──────────────────────────────────────────────────
const syntheticContent: Record<string, string> = {};
function buildDataset(noteCount: number, opts: {
matchFraction?: number;
labelsPerNote?: number;
depth?: number;
contentWordCount?: number;
varyContentSize?: boolean;
titleKeywords?: string[];
contentKeywords?: string[];
/** Include notes with diacritics in titles */
includeDiacritics?: boolean;
} = {}) {
const {
matchFraction = 0.1,
labelsPerNote = 3,
depth = 4,
contentWordCount = 300,
varyContentSize = true,
titleKeywords = ["target"],
contentKeywords = titleKeywords,
includeDiacritics = false,
} = opts;
becca.reset();
for (const key of Object.keys(syntheticContent)) {
delete syntheticContent[key];
}
const rootNote = new NoteBuilder(new BNote({ noteId: "root", title: "root", type: "text" }));
new BBranch({
branchId: "none_root",
noteId: "root",
parentNoteId: "none",
notePosition: 10
});
const containers: NoteBuilder[] = [];
let parent = rootNote;
for (let d = 0; d < depth; d++) {
const container = note(`Container_${d}_${randomWord(4)}`);
parent.child(container);
containers.push(container);
parent = container;
}
const matchCount = Math.floor(noteCount * matchFraction);
const diacriticTitles = [
"résumé", "naïve", "café", "über", "ñoño", "exposé",
"Ångström", "Üntersuchung", "São Paulo", "François"
];
for (let i = 0; i < noteCount; i++) {
const isMatch = i < matchCount;
let title: string;
if (includeDiacritics && i % 20 === 0) {
// Every 20th note gets a diacritics-heavy title
const dTitle = diacriticTitles[i % diacriticTitles.length];
title = isMatch
? `${dTitle} ${titleKeywords.join(" ")} Document ${i}`
: `${dTitle} ${randomWord(5)} Note ${i}`;
} else {
title = isMatch
? `${randomWord(5)} ${titleKeywords.join(" ")} ${randomWord(5)} Document ${i}`
: `${randomWord(5)} ${randomWord(6)} ${randomWord(4)} Note ${i}`;
}
const n = note(title);
for (let l = 0; l < labelsPerNote; l++) {
const labelName = isMatch && l === 0 ? "category" : `label_${randomWord(4)}`;
const labelValue = isMatch && l === 0 ? `important ${titleKeywords[0]}` : randomWord(8);
n.label(labelName, labelValue);
}
let noteWordCount = contentWordCount;
if (varyContentSize) {
const r = Math.random();
if (r < 0.2) noteWordCount = Math.floor(contentWordCount * (0.2 + Math.random() * 0.3));
else if (r < 0.7) noteWordCount = Math.floor(contentWordCount * (0.7 + Math.random() * 0.6));
else if (r < 0.9) noteWordCount = Math.floor(contentWordCount * (1.3 + Math.random() * 0.7));
else noteWordCount = Math.floor(contentWordCount * (2.0 + Math.random() * 1.0));
}
const includeContentKeyword = isMatch && contentKeywords.length > 0;
syntheticContent[n.note.noteId] = generateHtmlContent(
noteWordCount,
includeContentKeyword,
includeContentKeyword ? contentKeywords : undefined
);
const containerIndex = i % containers.length;
containers[containerIndex].child(n);
}
// Monkeypatch getContent()
for (const noteObj of Object.values(becca.notes)) {
const noteId = noteObj.noteId;
if (syntheticContent[noteId]) {
(noteObj as any).getContent = () => syntheticContent[noteId];
} else {
(noteObj as any).getContent = () => "";
}
}
return { rootNote, matchCount };
}
// ── benchmark runner ─────────────────────────────────────────────────
interface BenchmarkResult {
query: string;
mode: string;
noteCount: number;
avgMs: number;
minMs: number;
resultCount: number;
}
function runBenchmark(
query: string,
mode: "autocomplete" | "fullSearch",
fuzzyEnabled: boolean,
iterations = 5
): BenchmarkResult {
const noteCount = Object.keys(becca.notes).length;
// Warm up
if (mode === "autocomplete") {
searchService.searchNotesForAutocomplete(query, true);
} else {
const ctx = new SearchContext({ fastSearch: false });
ctx.enableFuzzyMatching = fuzzyEnabled;
searchService.findResultsWithQuery(query, ctx);
}
const times: number[] = [];
let resultCount = 0;
for (let i = 0; i < iterations; i++) {
if (mode === "autocomplete") {
// For autocomplete, fuzzy is controlled by the global option
// We'll manipulate enableFuzzyMatching after construction
const [results, ms] = timed(() => {
// searchNotesForAutocomplete creates its own SearchContext internally
// so we need to test via findResultsWithQuery for fuzzy control
const ctx = new SearchContext({
fastSearch: true,
includeHiddenNotes: true,
fuzzyAttributeSearch: true,
ignoreInternalAttributes: true,
autocomplete: true
});
ctx.enableFuzzyMatching = fuzzyEnabled;
return searchService.findResultsWithQuery(query, ctx);
});
times.push(ms);
resultCount = results.length;
} else {
const [results, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: false });
ctx.enableFuzzyMatching = fuzzyEnabled;
return searchService.findResultsWithQuery(query, ctx);
});
times.push(ms);
resultCount = results.length;
}
}
return {
query,
mode: `${mode}${fuzzyEnabled ? "+fuzzy" : ""}`,
noteCount,
avgMs: avg(times),
minMs: min(times),
resultCount
};
}
function printTable(title: string, results: BenchmarkResult[]) {
console.log(`\n${"═".repeat(110)}`);
console.log(` ${title}`);
console.log(`${"═".repeat(110)}`);
console.log(
" " +
"Query".padEnd(35) +
"Mode".padEnd(22) +
"Notes".padStart(7) +
"Avg (ms)".padStart(12) +
"Min (ms)".padStart(12) +
"Results".padStart(10)
);
console.log(` ${"─".repeat(98)}`);
for (const r of results) {
console.log(
" " +
`"${r.query}"`.padEnd(35) +
r.mode.padEnd(22) +
String(r.noteCount).padStart(7) +
r.avgMs.toFixed(1).padStart(12) +
r.minMs.toFixed(1).padStart(12) +
String(r.resultCount).padStart(10)
);
}
console.log(`${"═".repeat(110)}\n`);
}
// ── tests ────────────────────────────────────────────────────────────
describe("Comprehensive Search Benchmark", () => {
afterEach(() => {
becca.reset();
});
describe("Single-token queries", () => {
for (const noteCount of [1000, 5000, 10000, 20000]) {
it(`single token @ ${noteCount} notes — fuzzy on vs off, autocomplete vs full`, () => {
buildDataset(noteCount, {
matchFraction: 0.15,
titleKeywords: ["meeting"],
contentKeywords: ["meeting"],
contentWordCount: 300,
});
const results: BenchmarkResult[] = [
runBenchmark("meeting", "autocomplete", false),
runBenchmark("meeting", "autocomplete", true),
runBenchmark("meeting", "fullSearch", false),
runBenchmark("meeting", "fullSearch", true),
];
printTable(`Single Token "meeting" — ${noteCount} notes`, results);
expect(results[0].resultCount).toBeGreaterThan(0);
});
}
});
describe("Multi-token queries", () => {
for (const noteCount of [1000, 5000, 10000, 20000]) {
it(`multi token @ ${noteCount} notes — fuzzy on vs off`, () => {
buildDataset(noteCount, {
matchFraction: 0.15,
titleKeywords: ["meeting", "notes", "january"],
contentKeywords: ["meeting", "notes", "january"],
contentWordCount: 400,
});
const results: BenchmarkResult[] = [
// 2-token
runBenchmark("meeting notes", "autocomplete", false),
runBenchmark("meeting notes", "autocomplete", true),
runBenchmark("meeting notes", "fullSearch", false),
runBenchmark("meeting notes", "fullSearch", true),
// 3-token
runBenchmark("meeting notes january", "autocomplete", false),
runBenchmark("meeting notes january", "autocomplete", true),
runBenchmark("meeting notes january", "fullSearch", false),
runBenchmark("meeting notes january", "fullSearch", true),
];
printTable(`Multi Token — ${noteCount} notes`, results);
expect(results[0].resultCount).toBeGreaterThan(0);
});
}
});
describe("No-match queries (worst case — full scan, zero results)", () => {
for (const noteCount of [1000, 5000, 10000, 20000]) {
it(`no-match @ ${noteCount} notes`, () => {
buildDataset(noteCount, {
matchFraction: 0.1,
titleKeywords: ["target"],
contentKeywords: ["target"],
contentWordCount: 300,
});
const results: BenchmarkResult[] = [
runBenchmark("xyznonexistent", "autocomplete", false),
runBenchmark("xyznonexistent", "autocomplete", true),
runBenchmark("xyznonexistent", "fullSearch", false),
runBenchmark("xyznonexistent", "fullSearch", true),
runBenchmark("xyzfoo xyzbar", "autocomplete", false),
runBenchmark("xyzfoo xyzbar", "autocomplete", true),
runBenchmark("xyzfoo xyzbar", "fullSearch", false),
runBenchmark("xyzfoo xyzbar", "fullSearch", true),
];
printTable(`No-Match Queries — ${noteCount} notes`, results);
// All should return 0 results
for (const r of results) {
expect(r.resultCount).toBe(0);
}
});
}
});
describe("Diacritics / Unicode queries", () => {
for (const noteCount of [1000, 5000, 10000]) {
it(`diacritics @ ${noteCount} notes`, () => {
buildDataset(noteCount, {
matchFraction: 0.15,
titleKeywords: ["résumé"],
contentKeywords: ["résumé"],
contentWordCount: 300,
includeDiacritics: true,
});
const results: BenchmarkResult[] = [
// Exact diacritics
runBenchmark("résumé", "autocomplete", false),
runBenchmark("résumé", "autocomplete", true),
// ASCII equivalent (should still match via normalize)
runBenchmark("resume", "autocomplete", false),
runBenchmark("resume", "autocomplete", true),
// Full search
runBenchmark("résumé", "fullSearch", false),
runBenchmark("resume", "fullSearch", false),
];
printTable(`Diacritics "résumé" / "resume" — ${noteCount} notes`, results);
});
}
});
describe("Partial / prefix queries (simulating typing)", () => {
for (const noteCount of [5000, 10000, 20000]) {
it(`typing progression @ ${noteCount} notes`, () => {
buildDataset(noteCount, {
matchFraction: 0.15,
titleKeywords: ["documentation"],
contentKeywords: ["documentation"],
contentWordCount: 300,
});
const results: BenchmarkResult[] = [
runBenchmark("d", "autocomplete", false),
runBenchmark("do", "autocomplete", false),
runBenchmark("doc", "autocomplete", false),
runBenchmark("docu", "autocomplete", false),
runBenchmark("docum", "autocomplete", false),
runBenchmark("document", "autocomplete", false),
runBenchmark("documentation", "autocomplete", false),
// Same with fuzzy
runBenchmark("d", "autocomplete", true),
runBenchmark("doc", "autocomplete", true),
runBenchmark("document", "autocomplete", true),
runBenchmark("documentation", "autocomplete", true),
];
printTable(`Typing Progression "documentation" — ${noteCount} notes`, results);
});
}
});
describe("Attribute-matching queries", () => {
for (const noteCount of [5000, 10000]) {
it(`attribute search @ ${noteCount} notes`, () => {
buildDataset(noteCount, {
matchFraction: 0.15,
labelsPerNote: 5,
titleKeywords: ["important"],
contentKeywords: ["important"],
contentWordCount: 200,
});
const results: BenchmarkResult[] = [
// "category" is a label name on matching notes
runBenchmark("category", "autocomplete", false),
runBenchmark("category", "autocomplete", true),
runBenchmark("category", "fullSearch", false),
runBenchmark("category", "fullSearch", true),
// "important" appears in both title and label value
runBenchmark("important", "autocomplete", false),
runBenchmark("important", "autocomplete", true),
];
printTable(`Attribute Matching — ${noteCount} notes`, results);
});
}
});
describe("Long queries (4-5 tokens)", () => {
for (const noteCount of [5000, 10000]) {
it(`long query @ ${noteCount} notes`, () => {
buildDataset(noteCount, {
matchFraction: 0.10,
titleKeywords: ["quarterly", "budget", "review", "report"],
contentKeywords: ["quarterly", "budget", "review", "report"],
contentWordCount: 500,
});
const results: BenchmarkResult[] = [
runBenchmark("quarterly", "autocomplete", false),
runBenchmark("quarterly budget", "autocomplete", false),
runBenchmark("quarterly budget review", "autocomplete", false),
runBenchmark("quarterly budget review report", "autocomplete", false),
// Same with fuzzy
runBenchmark("quarterly budget review report", "autocomplete", true),
// Full search
runBenchmark("quarterly budget review report", "fullSearch", false),
runBenchmark("quarterly budget review report", "fullSearch", true),
];
printTable(`Long Queries (4 tokens) — ${noteCount} notes`, results);
});
}
});
describe("Mixed scenario — realistic user session", () => {
it("simulates a user session with varied queries @ 10K notes", () => {
buildDataset(10000, {
matchFraction: 0.15,
titleKeywords: ["project", "planning"],
contentKeywords: ["project", "planning", "timeline", "budget"],
contentWordCount: 400,
varyContentSize: true,
includeDiacritics: true,
depth: 6,
});
const results: BenchmarkResult[] = [
// Quick autocomplete lookups (user typing in search bar)
runBenchmark("pro", "autocomplete", false),
runBenchmark("project", "autocomplete", false),
runBenchmark("project plan", "autocomplete", false),
// Full search (user hits Enter)
runBenchmark("project", "fullSearch", false),
runBenchmark("project planning", "fullSearch", false),
runBenchmark("project planning", "fullSearch", true),
// Typo / near-miss with fuzzy
runBenchmark("projct", "autocomplete", false),
runBenchmark("projct", "autocomplete", true),
runBenchmark("projct planing", "fullSearch", false),
runBenchmark("projct planing", "fullSearch", true),
// No results
runBenchmark("xyznonexistent", "autocomplete", false),
runBenchmark("xyznonexistent foo", "fullSearch", true),
// Short common substring
runBenchmark("note", "autocomplete", false),
runBenchmark("document", "autocomplete", false),
];
printTable("Realistic User Session — 10K notes", results);
});
});
describe("Cache warmth impact", () => {
it("cold vs warm flat text index @ 10K notes", () => {
buildDataset(10000, {
matchFraction: 0.15,
titleKeywords: ["target"],
contentKeywords: ["target"],
contentWordCount: 300,
});
console.log(`\n${"═".repeat(80)}`);
console.log(" Cold vs Warm Cache — 10K notes");
console.log(`${"═".repeat(80)}`);
// Cold: first search after dataset build (flat text index not yet built)
becca.flatTextIndex = null;
becca.dirtyFlatTextNoteIds.clear();
const [coldResults, coldMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true, autocomplete: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("target", ctx);
});
console.log(` Cold (index build + search): ${coldMs.toFixed(1)}ms (${coldResults.length} results)`);
// Warm: subsequent searches reuse the index
const warmTimes: number[] = [];
for (let i = 0; i < 5; i++) {
const [, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: true, autocomplete: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("target", ctx);
});
warmTimes.push(ms);
}
console.log(` Warm (reuse index, 5 runs): avg ${avg(warmTimes).toFixed(1)}ms min ${min(warmTimes).toFixed(1)}ms`);
// Incremental: dirty a few notes and search again
const noteIds = Object.keys(becca.notes).slice(0, 50);
for (const nid of noteIds) {
becca.dirtyNoteFlatText(nid);
}
const [, incrMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true, autocomplete: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("target", ctx);
});
console.log(` Incremental (50 dirty notes): ${incrMs.toFixed(1)}ms`);
// Full rebuild
becca.flatTextIndex = null;
const [, rebuildMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true, autocomplete: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("target", ctx);
});
console.log(` Full rebuild (index = null): ${rebuildMs.toFixed(1)}ms`);
console.log(`${"═".repeat(80)}\n`);
});
});
describe("Fuzzy matching effectiveness comparison", () => {
it("exact vs fuzzy result quality @ 10K notes", () => {
buildDataset(10000, {
matchFraction: 0.10,
titleKeywords: ["performance"],
contentKeywords: ["performance", "optimization"],
contentWordCount: 300,
});
console.log(`\n${"═".repeat(90)}`);
console.log(" Fuzzy Matching Effectiveness — 10K notes");
console.log(`${"═".repeat(90)}`);
console.log(
" " +
"Query".padEnd(30) +
"Fuzzy".padEnd(8) +
"Time (ms)".padStart(12) +
"Results".padStart(10) +
" Notes"
);
console.log(` ${"─".repeat(70)}`);
const queries = [
"performance", // exact match
"performanc", // truncated
"preformance", // typo
"performence", // common misspelling
"optimization", // exact match
"optimzation", // typo
"perf optim", // abbreviated multi
];
for (const query of queries) {
for (const fuzzy of [false, true]) {
const times: number[] = [];
let resultCount = 0;
for (let i = 0; i < 3; i++) {
const [results, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
ctx.enableFuzzyMatching = fuzzy;
return searchService.findResultsWithQuery(query, ctx);
});
times.push(ms);
resultCount = results.length;
}
console.log(
" " +
`"${query}"`.padEnd(30) +
(fuzzy ? "ON" : "OFF").padEnd(8) +
avg(times).toFixed(1).padStart(12) +
String(resultCount).padStart(10)
);
}
}
console.log(`${"═".repeat(90)}\n`);
});
});
describe("Scale comparison summary", () => {
it("summary table across all note counts", () => {
const summaryResults: BenchmarkResult[] = [];
for (const noteCount of [1000, 5000, 10000, 20000]) {
buildDataset(noteCount, {
matchFraction: 0.15,
titleKeywords: ["meeting", "notes"],
contentKeywords: ["meeting", "notes"],
contentWordCount: 400,
varyContentSize: true,
depth: 5,
});
// Core scenarios
summaryResults.push(runBenchmark("meeting", "autocomplete", false));
summaryResults.push(runBenchmark("meeting", "autocomplete", true));
summaryResults.push(runBenchmark("meeting notes", "autocomplete", false));
summaryResults.push(runBenchmark("meeting notes", "autocomplete", true));
summaryResults.push(runBenchmark("meeting", "fullSearch", false));
summaryResults.push(runBenchmark("meeting", "fullSearch", true));
summaryResults.push(runBenchmark("meeting notes", "fullSearch", false));
summaryResults.push(runBenchmark("meeting notes", "fullSearch", true));
summaryResults.push(runBenchmark("xyznonexistent", "autocomplete", false));
summaryResults.push(runBenchmark("xyznonexistent", "fullSearch", true));
}
printTable("Scale Comparison Summary", summaryResults);
});
});
});

View File

@@ -0,0 +1,665 @@
/**
* Search performance profiling tests.
*
* These tests measure where time is spent in the search pipeline.
* We monkeypatch note.getContent() to return synthetic HTML content
* since unit tests don't have a real SQLite database.
*
* KNOWN GAPS vs production:
* - note.getContent() is instant (monkeypatched) vs ~2ms SQL fetch
* - NoteContentFulltextExp.execute() is skipped (no sql.iterateRows)
* because fastSearch=true uses only NoteFlatTextExp
* - These tests focus on the in-memory/CPU-bound parts of the pipeline
*/
import { describe, it, expect, beforeEach, afterEach } from "vitest";
import searchService from "./search.js";
import BNote from "../../../becca/entities/bnote.js";
import BBranch from "../../../becca/entities/bbranch.js";
import SearchContext from "../search_context.js";
import becca from "../../../becca/becca.js";
import beccaService from "../../../becca/becca_service.js";
import { NoteBuilder, note, id } from "../../../test/becca_mocking.js";
import SearchResult from "../search_result.js";
import { normalizeSearchText } from "../utils/text_utils.js";
// ── helpers ──────────────────────────────────────────────────────────
function randomWord(len = 6): string {
const chars = "abcdefghijklmnopqrstuvwxyz";
let word = "";
for (let i = 0; i < len; i++) {
word += chars[Math.floor(Math.random() * chars.length)];
}
return word;
}
function generateHtmlContent(wordCount: number, includeKeywords = false, keywords?: string[]): string {
const paragraphs: string[] = [];
let wordsRemaining = wordCount;
const kws = keywords ?? ["target"];
while (wordsRemaining > 0) {
const paraWords = Math.min(wordsRemaining, 20 + Math.floor(Math.random() * 40));
const words: string[] = [];
for (let i = 0; i < paraWords; i++) {
words.push(randomWord(3 + Math.floor(Math.random() * 10)));
}
if (includeKeywords && paragraphs.length === 2) {
// Inject all keywords into the paragraph at spaced positions
for (let k = 0; k < kws.length; k++) {
const pos = Math.min(words.length - 1, Math.floor((words.length / (kws.length + 1)) * (k + 1)));
words[pos] = kws[k];
}
}
paragraphs.push(`<p>${words.join(" ")}</p>`);
wordsRemaining -= paraWords;
}
return `<html><body>${paragraphs.join("\n")}</body></html>`;
}
function timed<T>(fn: () => T): [T, number] {
const start = performance.now();
const result = fn();
return [result, performance.now() - start];
}
interface TimingEntry { label: string; ms: number; }
function reportTimings(title: string, timings: TimingEntry[]) {
const total = timings.reduce((s, t) => s + t.ms, 0);
console.log(`\n=== ${title} (total: ${total.toFixed(1)}ms) ===`);
for (const { label, ms } of timings) {
const pct = total > 0 ? ((ms / total) * 100).toFixed(0) : "0";
const bar = "#".repeat(Math.max(1, Math.round(ms / total * 40)));
console.log(` ${label.padEnd(55)} ${ms.toFixed(1).padStart(8)}ms ${pct.padStart(3)}% ${bar}`);
}
}
// ── dataset builder ──────────────────────────────────────────────────
const syntheticContent: Record<string, string> = {};
function buildDataset(noteCount: number, opts: {
matchFraction?: number;
labelsPerNote?: number;
depth?: number;
contentWordCount?: number;
/** When set, contentWordCount is treated as a median and actual sizes vary from 0.2x to 3x */
varyContentSize?: boolean;
/** Keywords to inject into matching notes' titles (default: ["target"]) */
titleKeywords?: string[];
/** Keywords to inject into matching notes' content (default: same as titleKeywords) */
contentKeywords?: string[];
} = {}) {
const {
matchFraction = 0.1,
labelsPerNote = 3,
depth = 3,
contentWordCount = 200,
varyContentSize = false,
titleKeywords = ["target"],
contentKeywords = titleKeywords,
} = opts;
becca.reset();
for (const key of Object.keys(syntheticContent)) {
delete syntheticContent[key];
}
const rootNote = new NoteBuilder(new BNote({ noteId: "root", title: "root", type: "text" }));
new BBranch({
branchId: "none_root",
noteId: "root",
parentNoteId: "none",
notePosition: 10
});
const containers: NoteBuilder[] = [];
let parent = rootNote;
for (let d = 0; d < depth; d++) {
const container = note(`Container_${d}_${randomWord(4)}`);
parent.child(container);
containers.push(container);
parent = container;
}
const matchCount = Math.floor(noteCount * matchFraction);
for (let i = 0; i < noteCount; i++) {
const isMatch = i < matchCount;
const title = isMatch
? `${randomWord(5)} ${titleKeywords.join(" ")} ${randomWord(5)} Document ${i}`
: `${randomWord(5)} ${randomWord(6)} ${randomWord(4)} Note ${i}`;
const n = note(title);
for (let l = 0; l < labelsPerNote; l++) {
const labelName = isMatch && l === 0 ? "category" : `label_${randomWord(4)}`;
const labelValue = isMatch && l === 0 ? `important ${titleKeywords[0]}` : randomWord(8);
n.label(labelName, labelValue);
}
// Vary content size: 0.2x to 3x the median, producing a realistic
// mix of short stubs, medium notes, and long documents.
let noteWordCount = contentWordCount;
if (varyContentSize) {
const r = Math.random();
if (r < 0.2) {
noteWordCount = Math.floor(contentWordCount * (0.2 + Math.random() * 0.3)); // 20-50% (short stubs)
} else if (r < 0.7) {
noteWordCount = Math.floor(contentWordCount * (0.7 + Math.random() * 0.6)); // 70-130% (medium)
} else if (r < 0.9) {
noteWordCount = Math.floor(contentWordCount * (1.3 + Math.random() * 0.7)); // 130-200% (long)
} else {
noteWordCount = Math.floor(contentWordCount * (2.0 + Math.random() * 1.0)); // 200-300% (very long)
}
}
const includeContentKeyword = isMatch && contentKeywords.length > 0;
syntheticContent[n.note.noteId] = generateHtmlContent(
noteWordCount,
includeContentKeyword,
includeContentKeyword ? contentKeywords : undefined
);
const containerIndex = i % containers.length;
containers[containerIndex].child(n);
}
// Monkeypatch getContent()
for (const noteObj of Object.values(becca.notes)) {
const noteId = noteObj.noteId;
if (syntheticContent[noteId]) {
(noteObj as any).getContent = () => syntheticContent[noteId];
} else {
(noteObj as any).getContent = () => "";
}
}
return { rootNote, matchCount };
}
// ── profiling tests ──────────────────────────────────────────────────
describe("Search Profiling", () => {
afterEach(() => {
becca.reset();
});
/**
* Break down the autocomplete pipeline into every individual stage,
* including previously unmeasured operations like getBestNotePath,
* SearchResult construction, and getNoteTitleForPath.
*/
describe("Granular autocomplete pipeline", () => {
for (const noteCount of [500, 2000, 5000, 10000]) {
it(`granular breakdown with ${noteCount} notes`, () => {
const timings: TimingEntry[] = [];
const [, buildMs] = timed(() => buildDataset(noteCount, {
matchFraction: 0.2,
contentWordCount: 300,
depth: 5
}));
timings.push({ label: `Dataset build (${noteCount} notes)`, ms: buildMs });
// === NoteFlatTextExp: getCandidateNotes ===
// This calls getFlatText() + normalizeSearchText() for EVERY note
const allNotes = Object.values(becca.notes);
for (const n of allNotes) n.invalidateThisCache();
const [, candidateMs] = timed(() => {
const token = normalizeSearchText("target");
let count = 0;
for (const n of allNotes) {
const flatText = normalizeSearchText(n.getFlatText());
if (flatText.includes(token)) count++;
}
return count;
});
timings.push({ label: `getCandidateNotes simulation (cold caches)`, ms: candidateMs });
// Warm cache version
const [candidateCount, candidateWarmMs] = timed(() => {
const token = normalizeSearchText("target");
let count = 0;
for (const n of allNotes) {
const flatText = normalizeSearchText(n.getFlatText());
if (flatText.includes(token)) count++;
}
return count;
});
timings.push({ label: `getCandidateNotes simulation (warm caches)`, ms: candidateWarmMs });
// === getBestNotePath for each candidate ===
const candidates = allNotes.filter(n => {
const flatText = normalizeSearchText(n.getFlatText());
return flatText.includes("target");
});
const [, pathMs] = timed(() => {
for (const n of candidates) {
n.getBestNotePath();
}
});
timings.push({ label: `getBestNotePath (${candidates.length} notes)`, ms: pathMs });
// === SearchResult construction (includes getNoteTitleForPath) ===
const paths = candidates.map(n => n.getBestNotePath()).filter(Boolean);
const [searchResults, srMs] = timed(() => {
return paths.map(p => new SearchResult(p));
});
timings.push({ label: `SearchResult construction (${paths.length} results)`, ms: srMs });
// === computeScore ===
const [, scoreMs] = timed(() => {
for (const r of searchResults) {
r.computeScore("target", ["target"], true);
}
});
timings.push({ label: `computeScore with fuzzy (${searchResults.length} results)`, ms: scoreMs });
const [, scoreNoFuzzyMs] = timed(() => {
for (const r of searchResults) {
r.computeScore("target", ["target"], false);
}
});
timings.push({ label: `computeScore no-fuzzy`, ms: scoreNoFuzzyMs });
// === Sorting ===
const [, sortMs] = timed(() => {
searchResults.sort((a, b) => {
if (a.score !== b.score) return b.score - a.score;
if (a.notePathArray.length === b.notePathArray.length) {
return a.notePathTitle < b.notePathTitle ? -1 : 1;
}
return a.notePathArray.length - b.notePathArray.length;
});
});
timings.push({ label: `Sort results`, ms: sortMs });
// === Trim + content snippet extraction ===
const trimmed = searchResults.slice(0, 200);
const [, snippetMs] = timed(() => {
for (const r of trimmed) {
r.contentSnippet = searchService.extractContentSnippet(
r.noteId, ["target"]
);
}
});
timings.push({ label: `Content snippet extraction (${trimmed.length} results)`, ms: snippetMs });
const [, attrMs] = timed(() => {
for (const r of trimmed) {
r.attributeSnippet = searchService.extractAttributeSnippet(
r.noteId, ["target"]
);
}
});
timings.push({ label: `Attribute snippet extraction`, ms: attrMs });
// === Highlighting ===
const [, hlMs] = timed(() => {
searchService.highlightSearchResults(trimmed, ["target"]);
});
timings.push({ label: `Highlighting`, ms: hlMs });
// === Final mapping (getNoteTitleAndIcon) ===
const [, mapMs] = timed(() => {
for (const r of trimmed) {
beccaService.getNoteTitleAndIcon(r.noteId);
}
});
timings.push({ label: `getNoteTitleAndIcon (${trimmed.length} results)`, ms: mapMs });
// === Full autocomplete for comparison ===
const [autoResults, autoMs] = timed(() => {
return searchService.searchNotesForAutocomplete("target", true);
});
timings.push({ label: `Full autocomplete call (end-to-end)`, ms: autoMs });
reportTimings(`Granular Autocomplete — ${noteCount} notes`, timings);
expect(autoResults.length).toBeGreaterThan(0);
});
}
});
/**
* Test the specific cost of normalizeSearchText which is called
* pervasively throughout the pipeline.
*/
describe("normalizeSearchText cost", () => {
it("profile normalizeSearchText at scale", () => {
buildDataset(5000, { matchFraction: 0.2, contentWordCount: 100 });
// Generate various text lengths to profile
const shortTexts = Array.from({ length: 5000 }, () => randomWord(10));
const mediumTexts = Array.from({ length: 5000 }, () =>
Array.from({ length: 20 }, () => randomWord(6)).join(" ")
);
const longTexts = Object.values(becca.notes).map(n => n.getFlatText());
console.log("\n=== normalizeSearchText cost ===");
const [, shortMs] = timed(() => {
for (const t of shortTexts) normalizeSearchText(t);
});
console.log(` 5000 short texts (10 chars): ${shortMs.toFixed(1)}ms (${(shortMs/5000*1000).toFixed(1)}µs/call)`);
const [, medMs] = timed(() => {
for (const t of mediumTexts) normalizeSearchText(t);
});
console.log(` 5000 medium texts (120 chars): ${medMs.toFixed(1)}ms (${(medMs/5000*1000).toFixed(1)}µs/call)`);
const [, longMs] = timed(() => {
for (const t of longTexts) normalizeSearchText(t);
});
console.log(` ${longTexts.length} flat texts (varying): ${longMs.toFixed(1)}ms (${(longMs/longTexts.length*1000).toFixed(1)}µs/call)`);
});
});
/**
* Test the searchPathTowardsRoot recursive walk which runs
* for every candidate note in NoteFlatTextExp.
*/
describe("searchPathTowardsRoot cost", () => {
it("profile recursive walk with varying hierarchy depth", () => {
console.log("\n=== Search path walk vs hierarchy depth ===");
for (const depth of [3, 5, 8, 12]) {
buildDataset(2000, {
matchFraction: 0.15,
depth,
contentWordCount: 50
});
const [results, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
return searchService.findResultsWithQuery("target", ctx);
});
console.log(` depth=${depth}: ${ms.toFixed(1)}ms (${results.length} results)`);
}
});
});
/**
* Content snippet extraction scaling — the operation that calls
* note.getContent() for each result.
*/
describe("Content snippet extraction", () => {
it("profile snippet extraction with varying content sizes", () => {
console.log("\n=== Content snippet extraction vs content size ===");
for (const wordCount of [50, 200, 500, 1000, 2000, 5000]) {
buildDataset(500, {
matchFraction: 0.5,
contentWordCount: wordCount
});
const ctx = new SearchContext({ fastSearch: true });
const results = searchService.findResultsWithQuery("target", ctx);
const trimmed = results.slice(0, 200);
const [, ms] = timed(() => {
for (const r of trimmed) {
r.contentSnippet = searchService.extractContentSnippet(
r.noteId, ["target"]
);
}
});
const avgContentLen = Object.values(syntheticContent)
.slice(0, 100)
.reduce((s, c) => s + c.length, 0) / 100;
console.log(` ${String(wordCount).padStart(5)} words/note (avg ${Math.round(avgContentLen)} chars) × ${trimmed.length} results: ${ms.toFixed(1)}ms (${(ms / trimmed.length).toFixed(3)}ms/note)`);
}
});
it("profile snippet extraction with varying result counts", () => {
console.log("\n=== Content snippet extraction vs result count ===");
buildDataset(2000, {
matchFraction: 0.5,
contentWordCount: 500
});
const ctx = new SearchContext({ fastSearch: true });
const allResults = searchService.findResultsWithQuery("target", ctx);
for (const count of [5, 10, 20, 50, 100, 200]) {
const subset = allResults.slice(0, count);
const [, ms] = timed(() => {
for (const r of subset) {
r.contentSnippet = searchService.extractContentSnippet(
r.noteId, ["target"]
);
}
});
console.log(` ${String(count).padStart(3)} results: ${ms.toFixed(1)}ms (${(ms / count).toFixed(3)}ms/note)`);
}
});
});
/**
* Two-phase exact/fuzzy search cost.
*/
describe("Two-phase search cost", () => {
for (const noteCount of [1000, 5000, 10000]) {
it(`exact vs progressive with ${noteCount} notes`, () => {
const timings: TimingEntry[] = [];
buildDataset(noteCount, { matchFraction: 0.005, contentWordCount: 50 });
const [exactR, exactMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
ctx.enableFuzzyMatching = false;
return searchService.findResultsWithQuery("target", ctx);
});
timings.push({ label: `Exact-only (${exactR.length} results)`, ms: exactMs });
const [progR, progMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
return searchService.findResultsWithQuery("target", ctx);
});
timings.push({ label: `Progressive exact→fuzzy (${progR.length} results)`, ms: progMs });
const overhead = progMs - exactMs;
timings.push({ label: `Fuzzy phase overhead`, ms: Math.max(0, overhead) });
reportTimings(`Two-phase — ${noteCount} notes`, timings);
});
}
});
/**
* End-to-end scaling to give the full picture.
*/
/**
* Multi-token search with varying content sizes.
* Real users search things like "meeting notes january" — this exercises
* the multi-token path (which doesn't use the single-token fast path)
* with a realistic mix of note sizes.
*/
describe("Multi-token search with varying content sizes", () => {
it("single vs multi-token autocomplete at scale", () => {
console.log("\n=== Single vs multi-token autocomplete (varying content sizes) ===");
for (const noteCount of [1000, 5000, 10000, 20000]) {
buildDataset(noteCount, {
matchFraction: 0.15,
contentWordCount: 400,
varyContentSize: true,
depth: 5,
titleKeywords: ["meeting", "notes", "january"],
contentKeywords: ["meeting", "notes", "january"],
});
// Warm up
searchService.searchNotesForAutocomplete("meeting", true);
// Single token
const singleTimes: number[] = [];
for (let i = 0; i < 3; i++) {
const [, ms] = timed(() => searchService.searchNotesForAutocomplete("meeting", true));
singleTimes.push(ms);
}
const singleAvg = singleTimes.reduce((a, b) => a + b, 0) / singleTimes.length;
// Two tokens
const twoTimes: number[] = [];
for (let i = 0; i < 3; i++) {
const [, ms] = timed(() => searchService.searchNotesForAutocomplete("meeting notes", true));
twoTimes.push(ms);
}
const twoAvg = twoTimes.reduce((a, b) => a + b, 0) / twoTimes.length;
// Three tokens
const threeTimes: number[] = [];
for (let i = 0; i < 3; i++) {
const [, ms] = timed(() => searchService.searchNotesForAutocomplete("meeting notes january", true));
threeTimes.push(ms);
}
const threeAvg = threeTimes.reduce((a, b) => a + b, 0) / threeTimes.length;
console.log(
` ${String(noteCount).padStart(6)} notes: ` +
`1-token ${singleAvg.toFixed(1)}ms ` +
`2-token ${twoAvg.toFixed(1)}ms ` +
`3-token ${threeAvg.toFixed(1)}ms`
);
}
});
it("multi-token with realistic content size distribution", () => {
console.log("\n=== Multi-token search — content size distribution ===");
buildDataset(5000, {
matchFraction: 0.15,
contentWordCount: 400,
varyContentSize: true,
depth: 5,
titleKeywords: ["project", "review"],
contentKeywords: ["project", "review"],
});
// Report the actual content size distribution
const sizes = Object.values(syntheticContent).map(c => c.length);
sizes.sort((a, b) => a - b);
const p10 = sizes[Math.floor(sizes.length * 0.1)];
const p50 = sizes[Math.floor(sizes.length * 0.5)];
const p90 = sizes[Math.floor(sizes.length * 0.9)];
const p99 = sizes[Math.floor(sizes.length * 0.99)];
console.log(` Content sizes: p10=${p10} p50=${p50} p90=${p90} p99=${p99} chars`);
// Warm up
searchService.searchNotesForAutocomplete("project", true);
const queries = [
"project",
"project review",
"project review document",
`${randomWord(7)}`, // no-match single token
`${randomWord(5)} ${randomWord(6)}`, // no-match multi token
];
for (const query of queries) {
const times: number[] = [];
let resultCount = 0;
for (let i = 0; i < 3; i++) {
const [r, ms] = timed(() => searchService.searchNotesForAutocomplete(query, true));
times.push(ms);
resultCount = r.length;
}
const avg = times.reduce((a, b) => a + b, 0) / times.length;
const label = `"${query}"`.padEnd(35);
console.log(` ${label} ${avg.toFixed(1)}ms (${resultCount} results)`);
}
});
});
describe("End-to-end scaling", () => {
it("autocomplete at different scales", () => {
console.log("\n=== End-to-end autocomplete scaling ===");
console.log(" (fastSearch=true, monkeypatched getContent, no real SQL)");
for (const noteCount of [100, 500, 1000, 2000, 5000, 10000, 20000]) {
buildDataset(noteCount, {
matchFraction: 0.2,
contentWordCount: 300,
depth: 4
});
// Warm up
searchService.searchNotesForAutocomplete("target", true);
const times: number[] = [];
for (let i = 0; i < 3; i++) {
const [, ms] = timed(() => searchService.searchNotesForAutocomplete("target", true));
times.push(ms);
}
const avg = times.reduce((a, b) => a + b, 0) / times.length;
const min = Math.min(...times);
console.log(
` ${String(noteCount).padStart(6)} notes: avg ${avg.toFixed(1)}ms ` +
`min ${min.toFixed(1)}ms`
);
}
});
it("compare fast vs non-fast search", () => {
console.log("\n=== Fast vs non-fast search (no real SQL for content) ===");
for (const noteCount of [500, 2000, 5000]) {
buildDataset(noteCount, {
matchFraction: 0.2,
contentWordCount: 200,
depth: 4
});
const [, fastMs] = timed(() => {
const ctx = new SearchContext({ fastSearch: true });
return searchService.findResultsWithQuery("target", ctx);
});
// Non-fast search tries NoteContentFulltextExp which uses sql.iterateRows
// This will likely fail/return empty since there's no real DB, but we
// can still measure the overhead of attempting it
let nonFastMs: number;
let nonFastCount: number;
try {
const [results, ms] = timed(() => {
const ctx = new SearchContext({ fastSearch: false });
return searchService.findResultsWithQuery("target", ctx);
});
nonFastMs = ms;
nonFastCount = results.length;
} catch {
nonFastMs = -1;
nonFastCount = -1;
}
console.log(
` ${String(noteCount).padStart(5)} notes: fast=${fastMs.toFixed(1)}ms ` +
`non-fast=${nonFastMs >= 0 ? nonFastMs.toFixed(1) + 'ms' : 'FAILED (no real DB)'} ` +
`(non-fast results: ${nonFastCount})`
);
}
});
});
});

View File

@@ -21,4 +21,6 @@ export interface SearchParams {
limit?: number | null;
debug?: boolean;
fuzzyAttributeSearch?: boolean;
/** When true, skip the two-phase fuzzy fallback and use the single-token fast path. */
autocomplete?: boolean;
}

View File

@@ -275,19 +275,17 @@ export function fuzzyMatchWordWithResult(token: string, text: string, maxDistanc
}
try {
// Normalize both strings for comparison
// Normalize for comparison — some callers pass pre-normalized text,
// others don't, so this function must be self-contained.
const normalizedToken = token.toLowerCase();
const normalizedText = text.toLowerCase();
// Exact match check first (most common case)
if (normalizedText.includes(normalizedToken)) {
// Find the exact match in the original text to preserve case
const exactMatch = text.match(new RegExp(escapeRegExp(token), 'i'));
return exactMatch ? exactMatch[0] : token;
return token;
}
// For fuzzy matching, we need to check individual words in the text
// Split the text into words and check each word against the token
// For fuzzy matching, split into words and check each against the token
const words = normalizedText.split(/\s+/).filter(word => word.length > 0);
const originalWords = text.split(/\s+/).filter(word => word.length > 0);

111
docs/search-performance-benchmarks.md vendored Normal file
View File

@@ -0,0 +1,111 @@
# Search Performance Benchmarks
Comparison of `main` vs `feat/search-perf-take1` branch.
> **Methodology:** In-memory benchmarks using synthetic datasets with monkeypatched `getContent()`. Both branches tested on the same machine in the same session. Times are avg of 5 iterations with warm caches. Note content I/O (`NoteContentFulltextExp` blob scan) is not measured — these numbers reflect the in-memory pipeline only.
>
> **Benchmark source:** `apps/server/src/services/search/services/search_benchmark.spec.ts`
---
## End-to-End Results at 10K Notes
### Autocomplete (typing in the search bar, `fastSearch=true`)
| Query | main | this PR | Change |
|:------|-----:|--------:|-------:|
| `"meeting"` | 24.7ms | 14.3ms | **-42%** |
| `"meeting notes"` | 33.0ms | 15.6ms | **-53%** |
| `"meeting notes january"` | 43.2ms | 17.7ms | **-59%** |
| `"documentation"` | 17.5ms | 11.0ms | **-37%** |
| `"note"` (matches 85% of notes) | 90.8ms | 46.4ms | **-49%** |
| `"projct"` (typo, fuzzy ON) | 100.7ms | 6.0ms | **-94%** |
| `"xyznonexistent"` (no match, fuzzy ON) | 18.2ms | 6.0ms | **-67%** |
| `"xyzfoo xyzbar"` (no match, fuzzy ON) | 63.4ms | 7.1ms | **-89%** |
### Full Search (pressing Enter, `fastSearch=false`)
| Query | main | this PR | Change |
|:------|-----:|--------:|-------:|
| `"meeting"` | 22.9ms | 19.6ms | **-14%** |
| `"meeting notes"` | 35.7ms | 17.4ms | **-51%** |
| `"meeting notes january"` | 43.4ms | 21.0ms | **-52%** |
| `"quarterly budget review report"` | 37.1ms | 18.3ms | **-51%** |
| `"project planning"` | 27.4ms | 17.3ms | **-37%** |
### Full Search with Fuzzy Matching
| Query | main | this PR | Change |
|:------|-----:|--------:|-------:|
| `"meeting"` | 23.3ms | 17.8ms | **-24%** |
| `"meeting notes"` | 33.8ms | 18.6ms | **-45%** |
| `"meeting notes january"` | 43.2ms | 18.0ms | **-58%** |
| `"quarterly budget review report"` | 39.5ms | 17.2ms | **-56%** |
| `"project planning"` | 32.8ms | 18.6ms | **-43%** |
| `"projct planing"` (typo, recovers 1,500 results) | 133.8ms | 94.8ms | **-29%** |
| `"xyzfoo xyzbar"` (no match, worst case) | 64.2ms | 61.4ms | -4% |
---
## Scaling Behavior
### Autocomplete: `"meeting notes"` (fuzzy OFF)
| Notes | main | this PR | Change |
|------:|-----:|--------:|-------:|
| 1,000 | 2.7ms | 1.1ms | **-59%** |
| 5,000 | 15.8ms | 5.9ms | **-63%** |
| 10,000 | 33.0ms | 15.6ms | **-53%** |
| 20,000 | 67.3ms | 33.6ms | **-50%** |
### Full search: `"meeting notes january"` (fuzzy ON)
| Notes | main | this PR | Change |
|------:|-----:|--------:|-------:|
| 1,000 | 3.7ms | 1.3ms | **-65%** |
| 5,000 | 21.2ms | 8.7ms | **-59%** |
| 10,000 | 43.2ms | 18.0ms | **-58%** |
| 20,000 | 92.8ms | 40.1ms | **-57%** |
### Autocomplete no-match: `"xyzfoo xyzbar"` (fuzzy ON)
| Notes | main | this PR | Change |
|------:|-----:|--------:|-------:|
| 1,000 | 5.1ms | 0.4ms | **-92%** |
| 5,000 | 29.0ms | 2.2ms | **-92%** |
| 10,000 | 63.4ms | 7.1ms | **-89%** |
| 20,000 | 128.8ms | 19.1ms | **-85%** |
### Typing progression at 10K notes (autocomplete, fuzzy OFF)
| Prefix typed | main | this PR | Change |
|:-------------|-----:|--------:|-------:|
| `"d"` | 66.9ms | 44.8ms | **-33%** |
| `"doc"` | 20.9ms | 14.7ms | **-30%** |
| `"document"` | 16.8ms | 11.8ms | **-30%** |
| `"documentation"` | 17.5ms | 11.0ms | **-37%** |
---
## What Changed
1. **Pre-built flat text index** with incremental dirty-marking in Becca — avoids rebuilding per-note flat text on every search
2. **Skip two-phase fuzzy fallback for autocomplete** — the user is still typing, fuzzy adds latency for no benefit
3. **Pre-normalized attribute names/values** cached on `BAttribute` at construction time
4. **Cached normalized parent titles** per search execution via `Map` in `NoteFlatTextExp`
5. **Set-based token lookup** in `searchPathTowardsRoot` (O(1) vs O(n) `Array.includes`)
6. **Removed redundant `toLowerCase()`**`normalizeSearchText` already lowercases; callers were double-lowering
7. **Pre-normalize tokens once** in `addScoreForStrings` instead of re-normalizing per chunk
8. **Skip edit distance computation** when fuzzy matching is disabled
9. **Faster content snippet extraction** — regex `/<[^>]*>/g` instead of `striptags` library; normalize only the snippet window, not full content
10. **`removeDiacritic()` hoisted outside regex while-loop** in highlighting
11. **Single-token autocomplete fast path** — skips the recursive parent walk entirely, uses `getBestNotePath()` directly
12. **User option `searchEnableFuzzyMatching`** — lets users disable fuzzy matching for fastest possible search
---
## Known Limitations
- These benchmarks measure the **in-memory pipeline only** (titles, attributes, scoring, highlighting). The `NoteContentFulltextExp` sequential blob scan from SQLite is not exercised because `getContent()` is monkeypatched. In production, the full search path (`fastSearch=false`) includes reading every note's content from disk, which adds significant time at scale.
- Fuzzy matching on the full-search two-phase path shows slight regressions (+9-12%) for single-token queries because edit distance computation cost hasn't changed on that path. Multi-token queries still improve because the token normalization and tree walk optimizations apply to both paths.
- At 1K notes, some results show noise-level regressions. The optimizations target 5K+ note scales where overhead is measurable.

View File

@@ -134,6 +134,10 @@ export interface OptionDefinitions extends KeyboardShortcutsOptions<KeyboardActi
backgroundEffects: boolean;
newLayout: boolean;
// Search settings
/** Whether fuzzy matching is enabled in search (matches similar words when exact matches are insufficient). */
searchEnableFuzzyMatching: boolean;
// Share settings
redirectBareDomain: boolean;
showLoginInShareTheme: boolean;