Tag: CAT tools

  • The ten step Translation Memory maintenance bootcamp

    The ten step Translation Memory maintenance bootcamp

    In the second quarter of the year, the slew of public holidays in Austria (post-Easter observances) take over the calendar. It is either a very stop-start six weeks after Easter, with potential days for recharging your batteries, or can be the perfect time to spring clean your language data resources. Usually at least two such public holidays come with a Fenstertag (also known as a Brückentag or a Zwickeltag) i.e. a working day between two non-working days. If you have school-age children, schools might take their schulautonomer freier Tag (an extra holiday day set by each school). It is either an opportunity to take a minibreak with the family or to curse that some children are going to school while others are not.

    I often volunteer to work on one of the Fenstertage – it allows other members in my division to have a day off. In return, I have a quiet and near empty office for productive language data management days. Last month, had it not been for a fire alarm going off and building evacuation just before lunchtime, I could have barely left my office all day. As it was, the fire alarm and subsequent lunch allowed me to chat with colleagues I haven’t seen in a while – another win!

    This post covers the range of translation memory database-related activities I have carried out – both on Fenstertage or on a more ad hoc basis. Some of this work is also particularly relevant ahead of upgrading to a new version of Trados Studio and MultiTerm. But that wasn’t the only reason to have a spring clean.

    Translation Memory-related maintenance activities

    I work with a number of file-based translation memories. I back them up regularly, but usually do not conduct any maintenance routines. This is particularly the case for the read-only translation memories created from “quick and dirty” bilingual alignments. By quick and dirty, I mean that they were only given a minimal look over in either Excel or Notepad++ as part of the alignment. As a result, initially these contained very long pipe-separated segments that were only useful for concordance. As my proficiency in RegEx has improved, I have established a number of macros in Notepad++ to break them up into more manageable and TUs.

    The reason for taking the time to carry out in-depth maintenance and changing my approach, was that it was in danger of becoming a translation memory lifecycle management problem. TMs allowed to grow unchecked result in slower and longer backups. In turn, that leads to declining performance and rising risk of outdated content swamping results. From my Outlook reminder-led static backup process, I saw how file sizes increased. In turn, there were increasing amounts of translation units sitting unused due to changes in my work for banking supervision.

    It was with some relief that I saw that this problem was not unique to me. Munich Re’s Language Technology and Data Unit also mentioned this issue at a recent ETUG Bitesize webinar. In their case, they realised that it was necessary to conduct inventory work ahead of moving everything to the cloud. They found that they had over 100 TMs, including numerous duplicate ones, and rationalised this number substantially. From their implementation, an approach backing up only “Delta TMs” (i.e. ones that had changed since the last backup) proved useful.

    My ten step plan for my TMs

    Breaking down big TMs.

    1. Splitting TMs based on their recency and value: I have chosen to take a three tier approach. The top tier (Tier 1) is for the most recently used TUs and TMs. It comprised of TUs created or last used within the last 3 years. Tier 2 serve as reference only, i.e. for concordance searches, created or used in the last 3-7 years. Tier 3 is for all units that were created and last used over 7 years ago. My new Tier 1 and 2 TMs are retained in project settings. Tier 3 is something I actively have to choose to load (and would also have penalties) into an individual project. In preparation, filter each TM to find the number of TUs in scope for each age band. The results might reveal how much of your TMs are historic and unused.
    2. Introducing a moving window policy: only TUs that have been created and/or last used in the last three years prior to the cut-off date remain in Tier 1 TMs. Once TUs are older than that, they are exported from the Tier 1 TM and imported into the Tier 2 TM, and deleted from the Tier 1 TM. Similarly, the same happens for TUs in the Tier 2 TM after seven years that go into the Tier 3 TM. The moving date should be based on the LastUsed date, and as a fallback (i.e. for entries that were created and never used since) the CreatedDate. I have set a 60-day moving window approach. This also helps absorb the “quiet periods” e.g. summers where I may have been away for up to one month. Filtering by LastUsed and CreatedDate being over five years ago, can also help to see whether there are lots of affected units or only a few (and you can consider whether to change the moving window frequency.

    Split ’em up!

    1. Demote rather than delete: instead of just deleting the TUs from the Tier 1 TM, they should be exported to the Tier 2 TM. Instead of maintaining a Tier 3 TM, your could export the old TUs from Tier 2 as a TMX file that your retain. This latter approach might also be useful in terms of auditability, traceability, or the ability to recover specific legacy terminology.
    2. Splitting by content type: I looked through a lot of the jobs I did and noticed that I had often too many content types in a single TM back in the early days, where everything went into a single banking supervision TM and a “non-banking supervision TM”. Fortunately, with the moving window policy, the unlabelled TUs are now only in the Tier 2 TMs now. About five years ago, I introduced a metadata field for “Supervisory Area”, which has meant that I can use filters and penalties to get better control over my TMs. I have one TM for translations of laws and regulations, that is used only in read-only mode, and the metadata also explains which amendment they are from, allowing me also to ensure that outdated provisions go into the Tier 2 TM for that. I also keep “communications” texts from our press/communications team and financial literacy texts (which are more freely translated) in a separate TM.

    Quality Steps

    1. Deduplication and quality filter checks: when moving to the new hierarchy, it is also a very good opportunity to run some quality checks, such as removing very short TUs and ones that also are purely numeric, or ones that only contain URLs, or where the source/target are identical. Often there might be several target versions of the same source text – this can sometimes happen if the source uses a pronoun to talk about a type of entity and the target uses a noun. Conducting this now will help with future backups.
    2. Changing project settings: with the move to a tiered approach, I have revised all the project settings for projects to take into account the different hierarchy. It is essential this is done as soon as the hierarchy is changed to stop you from using your old setup (and having TUs added to your previous TM setup) – although if this does happen by accident, you can always do an export of all units last used after the project creation time (i.e. delta TUs). I have also thinned out the number of TMs searched for concordance only. Improved metadata on supervision areas has made this approach a lot less useful.

    Revising backup strategies and improving TM/TU governance

    1. Revising my backup strategy: with having smaller Tier 1 TMs it makes the backup process shorter, making it less of a dedicated task, and one that can run in the background more easily (e.g. while in a Teams call). Tier 2 TMs are more stable, so only need backups after old terms following additions from Tier 1 TMs and removed TUs moved into the Tier 3 TM.
    2. Create a set of filters that replicate the backup strategy: For each of your new tiered TMs you need to create the necessary filters for using with future backups. This ensures that you actually stick to your moving window policy.

    Improved Governance and the QuickFix

    1. Improve TM and TU governance: By using the moving window policy, it also improves the TM governance. I now know better about the specific supervisory domain at TU level. My Tier 1 TM for Banking Supervision, also covers with deposit guarantee supervision, payment institution supervision and electronic money institution supervision (i.e. national transpositions of CRR/CRD, DGSD, PSD2 and EMD (and soon PSD3/PSR) as well as early intervention and crisis management (BRRD). Similarly, I now have set review and archival cycles.
    2. Least time-intensive option: this is possibly like the TL:DR version of this post. If you currently use one TM, maybe consider an Active TM and a Reference TM (= Tier 1 and Tier 2/3). Similarly instead of a 3 year cut-off from Tier 1 to Tier 2, you could opt for a 5-year cut-off. You could also then reduce the backup frequency for your non-Active TM.