How RegEx pairs help me to prepare WordPress content for translation

One of my regular tasks is to translate news items (WordPress posts) on my employer’s website. If contributions contain ABBR and LANG tags for accessibility, I usually need to remove them before I translate. These approaches naturally also work for pages in WordPress. Some content generated in Word might contain other hidden tags inserted in the text. For this blog post, they relate to formatting phone numbers allowing dialling from a VOIP phone system.

Removing ABBR Tags

Accessibility requires spelling out abbreviations using the following ABBR tag pairs.

[abbr title="Bankwesengesetz"]BWG[/abbr]

In this specific case, in displayed posts, the abbreviation’s full form (text inside the title=” ” prior to the abbreviation) appears in a tooltip when hovering over the abbreviation between the tag pair.

I’ve recorded a macro in Notepad++ that does a search and replace for the following RegEx pair. In both cases the replace field is empty. With a macro like that I can execute it with a single keyboard shortcut, which can save a lot of time.

\[abbr.*?\]
\[\/abbr\]

I usually add the necessary abbreviations to the English version after translation in Trados from a file I have saved in Notepad++, and copy and paste the full file into the code view of the post in WordPress.

Removing LANG tags

LANG tags ensure that screen readers read words/phrases/sentences in a language other than the page language.

For example take the following sentence:

Article 38 of the Bankwesengesetz addresses banking secrecy requirements, commonly referred to in Austria as Bankgeheimnis.

A sample sentence showing an English sentence containing some German words.

The code view will show

Article 38 of the [lang title="DE"]Bankwesengesetz[/lang] addresses banking secrecy requirements, commonly referred to in Austria as [lang title="DE"]Bankgeheimnis[/lang].

To remove these tags, I perform a search and replace for the following two respective tags. The first one is used to select the tag before the words/phrases/sentences to be read by a screen reader in another language. The second tag selects the closing tag in the pair.

\[lang.*?\]
\[\/lang\] 

Removing proprietary tag pairs

The cited example removes the tags inserted to turn a telephone number, e.g. in a mail signature. The tag pair may be visible in the code view of the post text. Typically this is the case for the contact details of a media spokesperson in a press release. The tag pair’s purpose in this case is to allow a VOIP telephony system to dial a phone number. This may not work correctly, so it makes sense to remove the tag pair from the source code.

To do that, I use the following pair of entries in the search/replace function of Notepad++.

<avaya.*?>
</avaya.*?>

Further uses

There are endless uses in addition to the use cases above, One that I use quite often is to remove SPAN tags that appear in a post or page when copy-pasted out of MS Word. Typically, this is where someone has used the format painter, thereby creating some tag soup in the source text.

Why do I do this? SPAN tags can bloat the post/page code unnecessarily. This can prove disruptive for translating the text of a page/post.

Visited 31 times, 1 visit(s) today