Fuzzy-match football team names across data feeds. Try the open-source algorithm we built and use in production at Scorecast — handles abbreviations, accents, alternate languages, swapped home/away, and inconsistent league names.
The exact algorithm shown above is open-source under the MIT license. Pure Python, zero dependencies, fully typed.
Anyone who has joined data from two different sports providers has hit the wall: Man Utd vs Manchester United FC, Real Madrid CF vs Real Madrid, Bayern München vs FC Bayern Munich. Naive equality fails. Off-the-shelf string distance (e.g. difflib.SequenceMatcher) is fragile — it scores Manchester United vs Manchester City at over 80%, which is dangerously high.
Strip accents, drop parentheticals ((W), (II)), drop age tags (U21), split on whitespace and punctuation.
Drop generic designators (fc, sc, cf, real, atletico) and language particles. Keep distinguishing words like united, city.
sim = 0.4 × jaccard + 0.6 × containment. Containment normalises by the smaller set so Olancho matches Olancho FC at 1.0.
League names are wildly inconsistent across feeds. If both fixtures have a kickoff within 30 min, add up to +0.20. This single rule typically raises cross-feed match rates from ~10% to over 65%.
The algorithm above is published as an MIT-licensed Python package with zero dependencies and a full test suite. Drop it into any data pipeline.
pip install team-matcher
from datetime import datetime
from team_matcher import Candidate, match_fixture
candidates = [
Candidate(
home="Manchester United FC",
away="Liverpool FC",
league="Premier League",
kickoff=datetime(2026, 4, 27, 19, 45),
payload="match_id_123",
),
]
m = match_fixture("Man Utd", "Liverpool", "EPL",
candidates, kickoff=datetime(2026, 4, 27, 19, 45))
print(m.score, m.candidate.payload) # 1.0 match_id_123