Completed

Sommarie

An Extractive Summarization for News Articles using natural language processing.

PythonText-SummarizationNews

// overview

Overview

What was built

Extractive text summariser for news articles. Scores every sentence by word frequency, selects the top-N highest-scoring sentences, and returns them in original document order.

Why it was built

Extractive summarisation is a well-scoped NLP problem with a clear correctness criterion: the output must be factually identical to the source. It's a practical exercise in building a clean text processing pipeline with classical algorithms and no deep learning required.

Tech stack

Key features

✦Extractive: verbatim source sentences only

✦TF-IDF word frequency scoring

✦Position-aware penalty to reduce lede bias

✦Configurable summary ratio (default 30%)

✦Handles varied article structures

✦CPU-only with no GPU or deep learning dependencies

// architecture

Architecture & Design

A linear Python pipeline. Raw article text flows through preprocessing (tokenisation, normalisation), frequency scoring (word weights computed across the full document), sentence scoring (each sentence scored by its token weights), and top-N selection (highest-scored sentences returned in document order).

Client

Article Input

Raw news article (plain string or file)

→ Preprocessor (raw text)

Summary Output

Condensed extractive summary

Service Layer

Preprocessor

Sentence & word tokenisation, stop word removal, normalisation

→ Frequency Scorer (clean tokens)

Frequency Scorer

TF-IDF-style word frequency computation across full document

→ Sentence Scorer (word weights)

Sentence Scorer

Scores each sentence by sum of token weights + position penalty

→ Top-N Selector (scored sentences)

Top-N Selector

Picks highest-scoring sentences, returns in document order

→ Summary Output (top-N in order)

Client

Service Layer

// challenges & solutions

Challenges & Solutions

Problem

Long news articles contain filler and repetition. Readers want the key facts without the full read. An extractive summariser must identify the most informative sentences, though frequency scoring naively over-weights the opening paragraph, which introduces topic vocabulary at high density.

Constraints

Extractive only: output sentences must appear verbatim in the source
Must handle varied article structures without manual formatting
No deep learning model; classical NLP pipeline only

Approach

TF-IDF-style word frequency pipeline: tokenise into sentences and words, remove stop words, compute word weights, score each sentence by the sum of its word weights, apply a position penalty to discount early sentences, then select the top-N scoring sentences. N is a configurable ratio of total sentence count so the summary scales with article length. Sentences are returned in their original document order to maintain narrative coherence.

Technical challenges

Challenge

Articles with subheadings, lists, and mixed formatting broke naive sentence tokenisation, producing sentence fragments and false boundaries.

Solution

Added normalisation rules before tokenisation: strip markup artefacts, join continuation lines, remove non-sentence list items.

Outcome

Clean sentence boundaries across the range of article formats tested.

Challenge

The opening paragraph consistently dominated summaries because lede sentences introduce topic vocabulary at high density, inflating their frequency scores.

Solution

Applied a position penalty: sentence score is multiplied by a decay factor based on its index. Later sentences need a proportionally higher raw score to be selected.

Outcome

Summaries draw from throughout the article rather than concentrating in the first few sentences.

Challenge

A fixed N for summary length produced over-long summaries for short articles and too-brief ones for long articles.

Solution

Made N a configurable ratio of total sentence count (default 30%). The caller can tune the ratio; the algorithm adapts automatically.

Outcome

Summary length scales proportionally with article length across the full test corpus.

Engineering decisions

Extractive over abstractive summarisation

Rationale

Extractive methods are deterministic and hallucination-free, as they can only return sentences that exist in the source. Abstractive generation requires a large language model, introduces factual risk, and was out of scope for a classical NLP project.

Tradeoffs

Extractive summaries can feel choppy when the top-scoring sentences don't form a natural narrative. Abstractive summaries read more naturally but require substantially more infrastructure.

Frequency-based scoring over graph-based TextRank

Rationale

Frequency scoring is simpler to implement, tune, and reason about. TextRank was the obvious next step but wasn't needed to produce useful summaries at this stage.

Tradeoffs

TextRank considers inter-sentence similarity and centrality, so it would select more thematically representative sentences. Frequency scoring is noisier on topic-dense openers.

Ratio-based N over a fixed sentence count

Rationale

A fixed count produces wildly different compression ratios depending on article length. A ratio produces consistent compression across the article length distribution.

Tradeoffs

The ratio must still be tuned per domain, as news articles, academic papers, and social media posts need different default ratios.

// results & learnings

Results & Learnings

Outcomes

Coherent extractive summaries across varied news article formats

Summary length scales proportionally with article length

Position bias measurably reduced; summaries now draw from throughout the article

Runs on CPU with no external model dependencies

Learnings

Key Lessons

Preprocessing quality is the primary determinant of output quality; tokenisation errors propagate through every downstream step
Simple frequency scoring is effective for well-structured news articles; the main failure mode is positional bias, not the scoring algorithm itself
A ratio-based N is strictly better than a fixed count for variable-length inputs

Future Improvements

Implement TextRank and compare ROUGE scores against frequency scoring on a held-out test corpus
Add a web interface for paste-and-summarise use cases
Experiment with transformer-based abstractive summarisation as a comparison baseline

What I'd Do Differently

Build a ROUGE evaluation harness at the start; measuring quality objectively from the beginning changes what improvements you prioritise
Collect a structured test corpus before writing code, not after