How RegEx pairs help me to prepare WordPress content for translation

Posted on 19 December 202319 December 2023 by t9natno5

One of my regular tasks is to translate news items (WordPress posts) on my employer’s website. If contributions contain ABBR and LANG tags for accessibilityAccessibility means ensuring that there are no barriers for preventing interaction with websites or documents. Ensuring an accessible website is now mandatory under law for the public sector. In making a page or document accessible, I frequently have to: • Add image descriptions to graphic elements used in non-decorative manners. • Ensure the logical hierarchy of the document (e.g. heading More, I usually need to remove them before I translate. These approaches naturally also work for pages in WordPress. Some content generated in Word might contain other hidden tags inserted in the text. For this blog post, they relate to formatting phone numbers allowing dialling from a VOIP phone system.

Removing ABBR Tags

AccessibilityAccessibility means ensuring that there are no barriers for preventing interaction with websites or documents. Ensuring an accessible website is now mandatory under law for the public sector. In making a page or document accessible, I frequently have to: • Add image descriptions to graphic elements used in non-decorative manners. • Ensure the logical hierarchy of the document (e.g. heading More requires spelling out abbreviations using the following ABBR tag pairs.

[abbr title="Bankwesengesetz"]BWG[/abbr]

In this specific case, in displayed posts, the abbreviation’s full form (text inside the title=” ” prior to the abbreviation) appears in a tooltip when hovering over the abbreviation between the tag pair.

I’ve recorded a macro in Notepad++ that does a search and replace for the following RegExRegular expressions for translators are useful search tools for finding (and on occasion replacing) complex strings of characters. They can be used for ensuring consistent formatting, isolating cells of a certain format, and also for converting parts of TUs into non-translatable tags. More pair. In both cases the replace field is empty. With a macro like that I can execute it with a single keyboard shortcut, which can save a lot of time.

\[abbr.*?\]
\[\/abbr\]

I usually add the necessary abbreviations to the English version after translation in Trados from a file I have saved in Notepad++, and copy and paste the full file into the code view of the post in WordPress.

Removing LANG tags

LANG tags ensure that screen readers read words/phrases/sentences in a language other than the page language.

For example take the following sentence:

Article 38 of the Bankwesengesetz addresses banking secrecy requirements, commonly referred to in Austria as Bankgeheimnis.
A sample sentence showing an English sentence containing some German words.

The code view will show

Article 38 of the [lang title="DE"]Bankwesengesetz[/lang] addresses banking secrecy requirements, commonly referred to in Austria as [lang title="DE"]Bankgeheimnis[/lang].

To remove these tags, I perform a search and replace for the following two respective tags. The first one is used to select the tag before the words/phrases/sentences to be read by a screen reader in another language. The second tag selects the closing tag in the pair.

\[lang.*?\]
\[\/lang\]

Removing proprietary tag pairs

The cited example removes the tags inserted to turn a telephone number, e.g. in a mail signature. The tag pair may be visible in the code view of the post text. Typically this is the case for the contact details of a media spokesperson in a press release. The tag pair’s purpose in this case is to allow a VOIP telephony system to dial a phone number. This may not work correctly, so it makes sense to remove the tag pair from the source code.

To do that, I use the following pair of entries in the search/replace function of Notepad++.

<avaya.*?>
</avaya.*?>

Further uses

There are endless uses in addition to the use cases above, One that I use quite often is to remove SPAN tags that appear in a post or page when copy-pasted out of MS Word. Typically, this is where someone has used the format painter, thereby creating some tag soup in the source text.

Why do I do this? SPAN tags can bloat the post/page code unnecessarily. This can prove disruptive for translating the text of a page/post.

Visited 33 times, 1 visit(s) today

2 thoughts on “How RegEx pairs help me to prepare WordPress content for translation”

L says:

19 December 2023 at 22:35

Thanks for your blog post.
I share your enthusiasm for regular expressions.

On a similar note, I would like to add that Trados also offers some advanced possibilities when it comes to regular expressions:
The versatile regex based text filter in Trados Studio…

(I think there are similar options for other CAT tools.)

1. t9natno5 says:
  
  21 December 2023 at 21:57
  
  Hi L! Good to hear from you – we met in Bonn in 2019 – that seems a long time ago! Paul Filkin’s blog has some great RegEx posts and one use I have is in the QA tool – I learned about it from a speaker from the Commission at ETUG 2023 – a really entertaining presentation and I took away a lot from it. Michael