What changed in the topic-model pipeline, what failed in the early runs, and why the current scores should be treated as the post-rescue benchmark

This note exists so topic-model evaluation pages can link to one stable explanation instead of repeating fragments of the story. The short version is that we kept the honest early failures, rebuilt the evaluation path so topics are trained and projected split-safely, tuned within train-only pools, removed a sentence-splitting shortcut that made the task artificially easy, and repaired Top2Vec so held-out rows are projected into the learned topic space rather than quietly altering the fitted model.

source · walkthrough/topic_model_evaluation_implementation_note.html

Failure Timeline

We did not arrive at the current topic-model results by a smooth, monotone improvement. The important part of the implementation story is that the early runs exposed real methodological and systems problems, and those failures shaped the final design. 1 Overly easy preprocessing Some early topic-model runs used sentence-level splitting, which gave the topic models unnaturally simple inputs and made the comparison less faithful to the real evaluation task. That preprocessing was removed before the rescue runs. 2 Weak or broken first-pass results Top2Vec looked especially unstable. On Druglib rating, the old full-run strict holdout was -0.2012 and the old permissive holdout collapsed to -1.2708 . Those results were kept as failures rather than quietly overwritten. 3 Split-safety risk The evaluation goal was always to fit topic spaces inside the training data and then project validation or holdout rows into those learned spaces. Any path that fit on all rows, or that smuggled held-out information back into the topic model, would break the PsyProxy validation logic. 4 Stalled reruns and degenerate matrices Several topic reruns either stalled for hours or pushed low-information topic matrices into downstream ACE and regression steps. That led to explicit guardrails for degenerate columns, stronger benchmarking checks, and willingness to kill stalled jobs rather than pretending they were still productive. 5 Repair and transfer testing After the runner, projection path, and tuning loop were repaired, Top2Vec moved from negative Druglib rating performance to a permissive holdout of 0.2233 , which is competitive with strong non-PsyProxy baselines and much closer to the best PsyProxy lane. Why keep the failures visible? Because the final topic-model results only make sense if rea

What Was Repaired

The rescue work was not a single hyperparameter tweak. It was a coordinated rebuild of preprocessing, tuning, transformation, guardrails, and reproducibility assumptions. Validation discipline Topic tuning was moved into a train-only tuning pool. The tuning step is performed once per dataset and algorithm, not inside every validation fold, and the untouched holdout remains untouched until final scoring. This keeps the topic package aligned with the same PsyProxy validation logic used elsewhere. Algorithm-specific preprocessing Sentence splitting was removed. Stopword handling was kept algorithm-specific rather than borrowing one algorithm's recommendations for another. This mattered because one-size-fits-all text cleaning had already distorted the comparison. Top2Vec held-out projection The major Top2Vec repair was replacing the old transform path that relied on mutating the fitted model with add_documents() . Held-out rows are now embedded directly and projected against the learned topic vectors, which matches the intent of split-safe evaluation. Top2Vec reproducibility UMAP seeding is now passed through explicitly so repeated runs are less at the mercy of stochastic dimensionality reduction. This did not make Top2Vec perfect, but it removed a major source of false confidence and false regressions. Search space widening The rescue grid was widened beyond the early MiniLM-only defaults to include no-ngram leaf settings, higher dimensional leaf settings, and doc2vec families. Those candidate families were the ones that actually produced post-fix signal once the projection path was honest. Guardrails downstream Degenerate topic columns are now filtered before ACE and final model fitting. Earlier stalled or broken reruns showed that topic models can emit matrices that look

Result Snapshot

These numbers are included to show the difference between the failed early Top2Vec state and the repaired post-rescue state. They are not a substitute for the live evaluation tables; they are the implementation story behind those tables. System state Dataset Variant Holdout Interpretation Early Top2Vec full run Druglib rating strict -0.2012 Method failure signal, not deployment-ready. Early Top2Vec full run Druglib rating permissive -1.2708 Clear sign that the initial path could not be trusted. Repaired Top2Vec rescue Druglib rating strict 0.1832 Beats many classic text-feature baselines. Repaired Top2Vec rescue Druglib rating permissive 0.2233 Competitive with strong saved baseline systems and close to PsyProxy. Repaired Top2Vec full Disney candidate Disney rating permissive 0.1128 A real improvement over the broken state, but still weaker than Tomotopy. Tuned Tomotopy HDP Disney rating strict/permissive 0.2535 Proof that the topic pipeline can become competitive when the fit is stable.

Implementation Footprint

These are the main files behind the rescue. They should travel together when the criterion-comparison code is committed, because the note is only honest if it points to the real implementation changes. Core topic-model files packages/psyproxy-topic-server/src/psyproxy_topic_server/top2vec/runner.py packages/psyproxy-topic-server/src/psyproxy_topic_server/presets.py scripts/tune_top2vec_rescue.py packages/psyproxy-topic-server/tests/test_runner.py packages/psyproxy-topic-server/tests/test_presets.py Evaluation guardrail files packages/psyproxy-distributable/src/psyproxy/pipeline/evaluate.py packages/psyproxy-distributable/src/psyproxy/ranking/ace.py packages/psyproxy-distributable/src/psyproxy/models/implementations.py scripts/run_topic_package_benchmark.py presentation/walkthrough/topic_model_evaluation_implementation_note.html Tests and validation checks The runner and preset work now has direct test coverage, and the guardrail patch set previously cleared its targeted test bundle. The point is not that the topic stack is perfect, but that the rescue moved it out of undocumented trial-and-error territory and into something that can be audited, transferred, and rerun.

What should happen next

The next stage is not another quiet local experiment. It is a criterion-wide rerun using the repaired topic stack, with this note linked anywhere those topic-model scores are displayed. Link the note Every topic-model evaluation page should point to this note with the single label and explanation shown above. The link is the reader-facing guardrail that tells people these numbers were rescued, not just regenerated. Run the full criterion set Top2Vec and Tomotopy should be rerun on the criterion datasets with the repaired split-safe method and the current tuned defaults. The output tables should distinguish legacy topic-model runs from the post-rescue runs until the legacy rows are fully superseded. Commit the implementation The rescue is substantial enough that it should be committed into the criterion-comparison system repository rather than left in local state. That commit should include both the code changes and this note so the rationale travels with the implementation. Prepared as the single canonical explanation for the topic-model rescue effort. If the evaluation tables are updated again, this note can be revised, but the existence of early failures should remain visible.

Suggested Link For Topic-Model Pages

Failure Timeline

What Was Repaired

Result Snapshot

Implementation Footprint

What should happen next