How TLDR and DS1 curate thousands of research papers per day.

Takara TLDR is our daily research digest: a curated shortlist of the most relevant new papers across arXiv’s CS categories.

Under the hood, we run an end‑to‑end curation pipeline that starts from thousands of fresh papers and produces a final set of 50 in under 30 seconds.

The key enabler is DS1, Takara.ai’s CPU‑first embedding API built for low latency or real time applications.

What’s inside

the end‑to‑end curation pipeline (collect → embed → score → diversify → publish)
how we encode “taste” without brittle keyword rules
how we keep the list diverse without losing relevance
the practical choices that keep runtime measured in seconds

The pipeline at a glance

1. Collect the day’s arXiv CS papers (AI/LG/CL/CV). 2. Embed each paper (title + abstract) into a vector using DS1. 3. Score each vector against a curated reference set (our taste, represented as vectors) to get a relevance score. 4. Shortlist the top ~500 by score. 5. Diversify with MMR (Maximal Marginal Relevance) to select a final 50 that are both relevant and non‑redundant. 6. Publish to RSS and/or downstream summarisation.

How we encode “taste” without rules

Keyword filters are typically fragile: they miss novel phrasing, they overfit to trends, and they are painful to maintain. Instead, we treat taste as data.

We maintain a curated reference set of papers that reflect what we consistently value. That set becomes a vector namespace, and every new paper is scored by semantic similarity to it.

This makes the system:

opinionated by design (the reference set is the editorial anchor)
robust (semantic similarity beats synonyms and buzzwords)
iterable (improving the curated set improves the output, without rewriting logic)

Embed: DS1 turns text into vectors fast (on CPU)

Each paper is represented as:

Text: title + abstract
Vector: DS1 embedding

DS1 matters because we embed everything before we know what is worth keeping; doing that economically and quickly requires a CPU‑native path (no GPU scheduling overhead) and consistently low per‑request latency. By making embeddings cheap and low‑latency, we can keep recall high and still run daily at scale.

Score: k‑NN against the curated reference set

For each paper vector:

run a k‑nearest neighbours query against the curated reference vectors
convert neighbour distance into similarity
aggregate similarity (typically the mean of the top‑k similarities) into a single relevance score

Intuition: papers that “feel like” what we already curate should rank higher, even if the topic is new or the wording is unfamiliar.

Shortlist: keep the best ~500

After scoring, we keep a bounded candidate pool (typically the top ~500). This is a practical sweet spot:

big enough to preserve breadth and catch emerging themes
small enough to make the diversity step fast and predictable

Diversify: MMR picks 50 without duplicates

If you simply take the top 50 by score, you often get clusters of near‑duplicates: the same benchmark, the same method family, small variants.

We use Maximal Marginal Relevance (MMR) to balance:

relevance: the curated‑set similarity score
novelty: penalising candidates that are too similar to already‑selected papers

MMR selects papers iteratively, each time choosing the best trade‑off between “high score” and “not a duplicate of what we already have”.

Concrete example: if the top ten items are all “GPT‑4‑class reasoning” benchmark papers with very similar abstracts, MMR will tend to keep the strongest one or two and then let through a slightly lower‑scored paper in a different area (e.g., VLA (vision‑language‑action) robotics, compiler optimisation, or safety evaluation) so the final list covers more ground.

How this runs in under 30 seconds

The speed comes from keeping the system throughput‑oriented and bounded:

Batch embeddings: DS1 processes the day’s papers in bulk rather than one‑by‑one overhead.
Parallel scoring: similarity queries run concurrently with controlled parallelism.
Small neighbour sets: scoring uses a small k to estimate “fit” quickly.
Fixed-size diversity step: MMR runs on a capped pool rather than the full corpus.

The result is an end‑to‑end daily run measured in seconds, even when the input is thousands of papers.

How we keep quality and “vibe” consistent

The curated reference set is the control surface. To steer TLDR over time, we evolve that set via a lightweight editorial loop:

Add exemplars when a genuinely great paper appears in an underrepresented area.
Remove or de‑emphasise patterns that repeatedly generate low‑value recommendations.
Maintain coverage by ensuring the reference set reflects multiple sub‑domains, not a single trend.

In practice, we run this as a periodic review (typically every few weeks, or when the digest drifts): look at misses, look at duplicate clusters, adjust the reference set, and rerun. MMR then acts as the daily guardrail that keeps the output varied even when the input stream is trend-heavy.

What we learned in production

No curation system is set‑and‑forget; the most useful learnings came from watching where the pipeline fails and tightening the loop. The main failure modes (and why this design helps) are:

Topic collapse: a hot trend floods the top scores → MMR pushes back by penalising redundancy.
Reference set drift: the curated set stops reflecting what you want → updating exemplars restores alignment.
Over‑diversification: too much novelty at the expense of relevance → tune MMR’s trade‑off toward relevance.
Cold‑start themes: new topics that aren’t represented yet → add a small number of strong exemplars to the reference set.

Thoughts from me!

I've been working on TLDR in my spare time for over a year now, it serves over 5000 dedicated users and it's by far the best way to discover new research in this crazy world. I'm extremely happy with where we are with TLDR even though I've got a lot more in store for it, without DS1 it would be far too expensive to run and as you can see here I've spent alot of effort tuning it and trialing the process to deliver the best research to my friends and all our customers!

Thanks!

What’s inside

The pipeline at a glance

How we encode “taste” without rules

Embed: DS1 turns text into vectors fast (on CPU)

Score: k‑NN against the curated reference set

Diversify: MMR picks 50 without duplicates

Related Posts

Stay in the loop

Research

Companyものづくり

AI Services共生

Our productsおもてなし

Resources改善

Connectものづくり

Navigation

What’s inside

The pipeline at a glance

How we encode “taste” without rules

Embed: DS1 turns text into vectors fast (on CPU)

Score: k‑NN against the curated reference set

Diversify: MMR picks 50 without duplicates

Related Posts

Stay in the loop

Research

Companyものづくり

AI Services共生

Our productsおもてなし

Resources改善

Connectものづくり