regex - Removing all non-letter chars from a string with accents in Python

Question

Welcome To Ask or Share your Answers For Others

regex - Removing all non-letter chars from a string with accents in Python

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Removing all non-letter chars from a string with accents in Python

I'm trying to delete all non-letter chars (except white-space) from a string containing accents using Python 3.7. I tried the following:

import re

text = "Андре?й Серге?евич Арша?вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)."
clean_text = re.sub('[W_d]+', ' ', text)
print(clean_text)

The output is

Андре й Серге евич Арша вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Why do I get a whitespace after the accented char in my result string? This seems to violate the principle of least surprise. So I tried a different solution

text = "Андре?й Серге?евич Арша?вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)."
clean_text2 = "".join(c for c in text if c.isalpha() or c == " ")
print(clean_text2)

The output is

Андрей Сергеевич Аршавин род  мая  Ленинград  российский футболист бывший капитан сборной России заслуженный мастер спорта России

This is nearly what I wanted, except that it removes the accents from the chars. I would like to have the following result:

Андре?й Серге?евич Арша?вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Is there a way to remove all non-letter chars from a string, but keep the accents on the chars?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:42:59+0000

Basic solution for Russian word stress symbols

Russian letters do not have accents, the accent you have in the string shows the word stress, and is only used in specific written speech, like in textbooks for foreigners, books for children, etc.

The е? is a e letter and the u0301 char, 0301 COMBINING ACUTE ACCENT. The only accent diacritic can be subtracted from your pattern to get the results you want:

clean_text = re.sub(r'(?:(?!u0301)[Wd_])+', ' ', text)

See the Python demo yielding

Андре?й Серге?евич Арша?вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

See the regex demo online.

Solution supporting all diacritics - PyPi regex module

To keep all diacritic marks, the easiest is to install PyPi regex module (with pip install regex) and then use p{L} and p{M} Unicode property classes:

import regex

text = "Андре?й Серге?евич Арша?вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)."
print ( regex.sub(r'[^p{L}p{M}]+', ' ', text) )
# => Андре?й Серге?евич Арша?вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России
print( " ".join(regex.findall(r'(?>p{L}p{M}*+)+', text)) ) 
# => Андре?й Серге?евич Арша?вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Here, [^p{L}p{M}]+ regex matches any 1 or more chars other than Unicode letters (p{L}) and diacritic characters (p{M}). The other solution, (?>p{L}p{M}*+)+ with re.findall, extracts all letter + diacritic chunks from the text and then " ".join(...) concats them with a space.

Diacritics support with Python re

You will need to "spell out" the p{M} class and you may match any Unicode letter using [^Wd_] construct. It makes sense to use the find-all-words-and-then-concatenate approach here rather than re.sub:

import re
combining_marks_bmp = 'u0300-u036Fu0483-u0489u0591-u05BDu05BFu05C1u05C2u05C4u05C5u05C7u0610-u061Au064B-u065Fu0670u06D6-u06DCu06DF-u06E4u06E7u06E8u06EA-u06EDu0711u0730-u074Au07A6-u07B0u07EB-u07F3u0816-u0819u081B-u0823u0825-u0827u0829-u082Du0859-u085Bu08E3-u0903u093A-u093Cu093E-u094Fu0951-u0957u0962u0963u0981-u0983u09BCu09BE-u09C4u09C7u09C8u09CB-u09CDu09D7u09E2u09E3u0A01-u0A03u0A3Cu0A3E-u0A42u0A47u0A48u0A4B-u0A4Du0A51u0A70u0A71u0A75u0A81-u0A83u0ABCu0ABE-u0AC5u0AC7-u0AC9u0ACB-u0ACDu0AE2u0AE3u0B01-u0B03u0B3Cu0B3E-u0B44u0B47u0B48u0B4B-u0B4Du0B56u0B57u0B62u0B63u0B82u0BBE-u0BC2u0BC6-u0BC8u0BCA-u0BCDu0BD7u0C00-u0C03u0C3E-u0C44u0C46-u0C48u0C4A-u0C4Du0C55u0C56u0C62u0C63u0C81-u0C83u0CBCu0CBE-u0CC4u0CC6-u0CC8u0CCA-u0CCDu0CD5u0CD6u0CE2u0CE3u0D01-u0D03u0D3E-u0D44u0D46-u0D48u0D4A-u0D4Du0D57u0D62u0D63u0D82u0D83u0DCAu0DCF-u0DD4u0DD6u0DD8-u0DDFu0DF2u0DF3u0E31u0E34-u0E3Au0E47-u0E4Eu0EB1u0EB4-u0EB9u0EBBu0EBCu0EC8-u0ECDu0F18u0F19u0F35u0F37u0F39u0F3Eu0F3Fu0F71-u0F84u0F86u0F87u0F8D-u0F97u0F99-u0FBCu0FC6u102B-u103Eu1056-u1059u105E-u1060u1062-u1064u1067-u106Du1071-u1074u1082-u108Du108Fu109A-u109Du135D-u135Fu1712-u1714u1732-u1734u1752u1753u1772u1773u17B4-u17D3u17DDu180B-u180Du18A9u1920-u192Bu1930-u193Bu1A17-u1A1Bu1A55-u1A5Eu1A60-u1A7Cu1A7Fu1AB0-u1ABEu1B00-u1B04u1B34-u1B44u1B6B-u1B73u1B80-u1B82u1BA1-u1BADu1BE6-u1BF3u1C24-u1C37u1CD0-u1CD2u1CD4-u1CE8u1CEDu1CF2-u1CF4u1CF8u1CF9u1DC0-u1DF5u1DFC-u1DFFu20D0-u20F0u2CEF-u2CF1u2D7Fu2DE0-u2DFFu302A-u302Fu3099u309AuA66F-uA672uA674-uA67DuA69EuA69FuA6F0uA6F1uA802uA806uA80BuA823-uA827uA880uA881uA8B4-uA8C4uA8E0-uA8F1uA926-uA92DuA947-uA953uA980-uA983uA9B3-uA9C0uA9E5uAA29-uAA36uAA43uAA4CuAA4DuAA7B-uAA7DuAAB0uAAB2-uAAB4uAAB7uAAB8uAABEuAABFuAAC1uAAEB-uAAEFuAAF5uAAF6uABE3-uABEAuABECuABEDuFB1EuFE00-uFE0FuFE20-uFE2F'
combining_marks_astral = 'uD805[uDCB0-uDCC3uDDAF-uDDB5uDDB8-uDDC0uDDDCuDDDDuDE30-uDE40uDEAB-uDEB7uDF1D-uDF2B]|uD834[uDD65-uDD69uDD6D-uDD72uDD7B-uDD82uDD85-uDD8BuDDAA-uDDADuDE42-uDE44]|uD804[uDC00-uDC02uDC38-uDC46uDC7F-uDC82uDCB0-uDCBAuDD00-uDD02uDD27-uDD34uDD73uDD80-uDD82uDDB3-uDDC0uDDCA-uDDCCuDE2C-uDE37uDEDF-uDEEAuDF00-uDF03uDF3CuDF3E-uDF44uDF47uDF48uDF4B-uDF4DuDF57uDF62uDF63uDF66-uDF6CuDF70-uDF74]|uD81B[uDF51-uDF7EuDF8F-uDF92]|uD81A[uDEF0-uDEF4uDF30-uDF36]|uD82F[uDC9DuDC9E]|uD800[uDDFDuDEE0uDF76-uDF7A]|uD836[uDE00-uDE36uDE3B-uDE6CuDE75uDE84uDE9B-uDE9FuDEA1-uDEAF]|uD802[uDE01-uDE03uDE05uDE06uDE0C-uDE0FuDE38-uDE3AuDE3FuDEE5uDEE6]|uD83A[uDCD0-uDCD6]|uDB40[uDD00-uDDEF]'
letter = r'[^Wd_]'
pat = re.compile(r'(?:{}|[{}]|{})+'.format(letter,combining_marks_bmp, combining_marks_astral))
print(" ".join(pat.findall(text)))
# => Андре?й Серге?евич Арша?вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

See the online Python demo

Categories

regex - Removing all non-letter chars from a string with accents in Python

regex - Removing all non-letter chars from a string with accents in Python

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags