European Language Data Space Archives - In-house translation

On Monday 15th September, I was a panelist and participant at the country workshop for Austria on the European Language Data Space. I’ll have to profess that I went into the day relatively uninformed about the project, so greatly appreciated the introduction about the European LDS. It was also good for me to take off my “Aufsichtsbrille” and understand what possibilities like beyond my small corner.

Language data is so much more than text-based corpora: while initial training of LLMs used massive text-based corpora, language data can be so much more than text-based. With Speech-to-Text and Text-to-Speech being increasingly frequently uses, there are naturally massive audio language data files. My voice alarms’ female British English accent probably primed me to think only of a very small number of voices covering a locale.
However, particularly for training speech-to-text applications,you need vast audio files covering massive ranges of accents, dialects, ages, pitches of voice. New text-based language data sets are increasingly made up of synthetic data. This led me to wonder about the adequate labeling of datasets.
Getting the language data in is comparatively easy, but the training stage is time-consuming and expensive. Even if the language data is readily available at a low cost, that is only part of the story. The compute power required is time-consuming – cases were mentioned of days, weeks and months. And the compute power is not cheap. And then if say a tiny part of the data is then removed and must no longer be used in the model, how do you get it out.
After all, retraining on a new “clean” dataset means more training weeks/months and compute time. This issue is why I am not surprised that Anthrophic et al. are choosing billion dollar settlements in legal cases, rather than the cost of retraining and/or the hard work “getting the toothpaste back in the tube”.
Monetization of language data in Europe is a new concept: BigTechs have also got around some of the issues surrounding data scraping by setting up deals with platforms to allow them to use their data for AI training. Google’s $60m a year deal with Reddit in February 2024 was cited on several occasions. As I put these thoughts into text, there are stories breaking of even bigger/closer ties.
One intervention of mine during the panel was that monetization might present problems for example for authorities – for example where supervisory authorities are funded by supervisory fees, supervisory entities might not be happy that the data created in their being supervised is monetized and only available for them for a fee.
Europe does things differently: Europe thinks in hardware, software and innovation cycles that are still remarkably slow compared to the US or Asia. The LLM/GenAI age and increasing frequency / shortening intervals between releases shows these cycles as being too long. Maybe this is why there are no European BigTech players. Europe might be falling behind due to over-engineering. To use a language allegory, Europe is a “guardian of usage” stifling rather than driving innovation, possibly due to heavier touch regulation.
However, at the same time, Europe’s appraoch is more one of “for the greater good”, rather than the personal enrichment of the select few. The literary translation keynote from the FIT World Congress in Geneva given by Anton Hur touched on how Silicon Valley specialises in vaporware.
Metadata plays a massive role: when I started using a CAT tool, it was as simple as have source and target languages in TUs. However, since I started at the FMA, I have really understood the need for additional metadata – e.g. to indicate the supervisory area, locales, various usage fields etc. Volume/amount of data is not the only indicator of value of that data. BigTech always goes towards the “big(ger) is (more) beautiful” approach regarding the amount of data used to train LLMs. So yes, small (with good metadata) can be just as beautiful.
“Getting the toothpaste back in the tube” – how do you remove data from models? One of the biggest asymmetries is the difficulty of removing data from a model without having to retrain the model compared to getting data in the model. It really is a black box. Similarly, with language constantly changing there is always the issue that data that is current now, does not always remain current, and therefore needs replacing. It would be interesting to understand the breakdown between the training and production costs of models, to understand how big the assymetry is (I can only assume it is a large one!).
I’ve always understood the value of data, but not been able to value it: the platform for monetization of data really brought home the value of data – and reinforced that there is also real value in well-curated data (as I see from my work with translation memories and termbases). The “fuzziness” of synthetic data also brings home the real value that lies in human-curated data.

Tag: European Language Data Space

7 thoughts: takeaways from the European Language Data Space Workshop