Sommarie
An Extractive Summarization for News Articles using natural language processing.
Overview
What was built
Extractive text summariser for news articles. Scores every sentence by word frequency, selects the top-N highest-scoring sentences, and returns them in original document order.
Why it was built
Extractive summarisation is a well-scoped NLP problem with a clear correctness criterion: the output must be factually identical to the source. It's a practical exercise in building a clean text processing pipeline with classical algorithms and no deep learning required.
Architecture & Design
A linear Python pipeline. Raw article text flows through preprocessing (tokenisation, normalisation), frequency scoring (word weights computed across the full document), sentence scoring (each sentence scored by its token weights), and top-N selection (highest-scored sentences returned in document order).
Raw news article (plain string or file)
Condensed extractive summary
Sentence & word tokenisation, stop word removal, normalisation
TF-IDF-style word frequency computation across full document
Scores each sentence by sum of token weights + position penalty
Picks highest-scoring sentences, returns in document order
Challenges & Solutions
Problem
Long news articles contain filler and repetition. Readers want the key facts without the full read. An extractive summariser must identify the most informative sentences, though frequency scoring naively over-weights the opening paragraph, which introduces topic vocabulary at high density.
Constraints
- Extractive only: output sentences must appear verbatim in the source
- Must handle varied article structures without manual formatting
- No deep learning model; classical NLP pipeline only
Approach
TF-IDF-style word frequency pipeline: tokenise into sentences and words, remove stop words, compute word weights, score each sentence by the sum of its word weights, apply a position penalty to discount early sentences, then select the top-N scoring sentences. N is a configurable ratio of total sentence count so the summary scales with article length. Sentences are returned in their original document order to maintain narrative coherence.
Articles with subheadings, lists, and mixed formatting broke naive sentence tokenisation, producing sentence fragments and false boundaries.
Added normalisation rules before tokenisation: strip markup artefacts, join continuation lines, remove non-sentence list items.
Clean sentence boundaries across the range of article formats tested.
The opening paragraph consistently dominated summaries because lede sentences introduce topic vocabulary at high density, inflating their frequency scores.
Applied a position penalty: sentence score is multiplied by a decay factor based on its index. Later sentences need a proportionally higher raw score to be selected.
Summaries draw from throughout the article rather than concentrating in the first few sentences.
A fixed N for summary length produced over-long summaries for short articles and too-brief ones for long articles.
Made N a configurable ratio of total sentence count (default 30%). The caller can tune the ratio; the algorithm adapts automatically.
Summary length scales proportionally with article length across the full test corpus.
Extractive over abstractive summarisation
Extractive methods are deterministic and hallucination-free, as they can only return sentences that exist in the source. Abstractive generation requires a large language model, introduces factual risk, and was out of scope for a classical NLP project.
Extractive summaries can feel choppy when the top-scoring sentences don't form a natural narrative. Abstractive summaries read more naturally but require substantially more infrastructure.
Frequency-based scoring over graph-based TextRank
Frequency scoring is simpler to implement, tune, and reason about. TextRank was the obvious next step but wasn't needed to produce useful summaries at this stage.
TextRank considers inter-sentence similarity and centrality, so it would select more thematically representative sentences. Frequency scoring is noisier on topic-dense openers.
Ratio-based N over a fixed sentence count
A fixed count produces wildly different compression ratios depending on article length. A ratio produces consistent compression across the article length distribution.
The ratio must still be tuned per domain, as news articles, academic papers, and social media posts need different default ratios.
Results & Learnings
Coherent extractive summaries across varied news article formats
Summary length scales proportionally with article length
Position bias measurably reduced; summaries now draw from throughout the article
Runs on CPU with no external model dependencies
Key Lessons
- Preprocessing quality is the primary determinant of output quality; tokenisation errors propagate through every downstream step
- Simple frequency scoring is effective for well-structured news articles; the main failure mode is positional bias, not the scoring algorithm itself
- A ratio-based N is strictly better than a fixed count for variable-length inputs
Future Improvements
- Implement TextRank and compare ROUGE scores against frequency scoring on a held-out test corpus
- Add a web interface for paste-and-summarise use cases
- Experiment with transformer-based abstractive summarisation as a comparison baseline
What I'd Do Differently
- Build a ROUGE evaluation harness at the start; measuring quality objectively from the beginning changes what improvements you prioritise
- Collect a structured test corpus before writing code, not after