fix: tokenizer splits hyphenated identifiers
Hyphen- and slash-separated identifiers (polisher-control, twyman-green, etc.) were single tokens in the reinforcement / memory-ranking tokenizer, so queries had to match the exact hyphenation to score. The harness caught this on p06-control-rule: 'polisher control design rule' scored 2 overlap on each of the three polisher-*/design-rule memories and the tiebreaker picked the wrong one. Now hyphenated words contribute both the full form AND each sub-token. Extracted _add_token helper to avoid duplicating the stop-word / length gate at both insertion points. Reinforcement matcher tests still pass (28) — the new sub-tokens only widen the match set, they never narrow it, so memories that previously reinforced continue to reinforce. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -180,20 +180,36 @@ def _stem(word: str) -> str:
|
|||||||
def _tokenize(text: str) -> set[str]:
|
def _tokenize(text: str) -> set[str]:
|
||||||
"""Split normalized text into a stemmed token set.
|
"""Split normalized text into a stemmed token set.
|
||||||
|
|
||||||
Strips punctuation, drops words shorter than 3 chars and stop words.
|
Strips punctuation, drops words shorter than 3 chars and stop
|
||||||
|
words. Hyphenated and slash-separated identifiers
|
||||||
|
(``polisher-control``, ``twyman-green``, ``2-projects/interferometer``)
|
||||||
|
produce both the full form AND each sub-token, so a query for
|
||||||
|
"polisher control" can match a memory that wrote
|
||||||
|
"polisher-control" without forcing callers to guess the exact
|
||||||
|
hyphenation.
|
||||||
"""
|
"""
|
||||||
tokens: set[str] = set()
|
tokens: set[str] = set()
|
||||||
for raw in text.split():
|
for raw in text.split():
|
||||||
# Strip leading/trailing punctuation (commas, periods, quotes, etc.)
|
|
||||||
word = raw.strip(".,;:!?\"'()[]{}-/")
|
word = raw.strip(".,;:!?\"'()[]{}-/")
|
||||||
if len(word) < 3:
|
if not word:
|
||||||
continue
|
continue
|
||||||
if word in _STOP_WORDS:
|
_add_token(tokens, word)
|
||||||
continue
|
# Also add sub-tokens split on internal '-' or '/' so
|
||||||
tokens.add(_stem(word))
|
# hyphenated identifiers match queries that don't hyphenate.
|
||||||
|
if "-" in word or "/" in word:
|
||||||
|
for sub in re.split(r"[-/]+", word):
|
||||||
|
_add_token(tokens, sub)
|
||||||
return tokens
|
return tokens
|
||||||
|
|
||||||
|
|
||||||
|
def _add_token(tokens: set[str], word: str) -> None:
|
||||||
|
if len(word) < 3:
|
||||||
|
return
|
||||||
|
if word in _STOP_WORDS:
|
||||||
|
return
|
||||||
|
tokens.add(_stem(word))
|
||||||
|
|
||||||
|
|
||||||
def _memory_matches(memory_content: str, normalized_response: str) -> bool:
|
def _memory_matches(memory_content: str, normalized_response: str) -> bool:
|
||||||
"""Return True if enough of the memory's tokens appear in the response.
|
"""Return True if enough of the memory's tokens appear in the response.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user