On Monday 15th September, I was a panelist and participant at the country workshop for Austria on the European Language Data Space. I’ll have to profess that I went into the day relatively uninformed about the project, so greatly appreciated the introduction about the European LDS. It was also good for me to take off my “Aufsichtsbrille” and understand what possibilities like beyond my small corner.
- Language data is so much more than text-based corpora: while initial training of LLMsA large language model (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. More used massive text-based corpora, language data can be so much more than text-based. With Speech-to-Text and Text-to-Speech being increasingly frequently uses, there are naturally massive audio language data files. My voice alarms’ female British English accent probably primed me to think only of a very small number of voices covering a locale.
However, particularly for training speech-to-text applications,you need vast audio files covering massive ranges of accents, dialects, ages, pitches of voice. New text-based language data sets are increasingly made up of synthetic data. This led me to wonder about the adequate labeling of datasets. - Getting the language data in is comparatively easy, but the training stage is time-consuming and expensive. Even if the language data is readily available at a low cost, that is only part of the story. The compute power required is time-consuming – cases were mentioned of days, weeks and months. And the compute power is not cheap. And then if say a tiny part of the data is then removed and must no longer be used in the model, how do you get it out.
After all, retraining on a new “clean” dataset means more training weeks/months and compute time. This issue is why I am not surprised that Anthrophic et al. are choosing billion dollar settlements in legal cases, rather than the cost of retraining and/or the hard work “getting the toothpaste back in the tube”. - Monetization of language data in Europe is a new concept: BigTechs have also got around some of the issues surrounding data scraping by setting up deals with platforms to allow them to use their data for AI training. Google’s $60m a year deal with Reddit in February 2024 was cited on several occasions. As I put these thoughts into text, there are stories breaking of even bigger/closer ties.
One intervention of mine during the panel was that monetization might present problems for example for authorities – for example where supervisory authorities are funded by supervisory fees, supervisory entities might not be happy that the data created in their being supervised is monetized and only available for them for a fee. - Europe does things differently: Europe thinks in hardware, software and innovation cycles that are still remarkably slow compared to the US or Asia. The LLMA large language model (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. More/GenAI age and increasing frequency / shortening intervals between releases shows these cycles as being too long. Maybe this is why there are no European BigTech players. Europe might be falling behind due to over-engineering. To use a language allegory, Europe is a “guardian of usage” stifling rather than driving innovation, possibly due to heavier touch regulation.
However, at the same time, Europe’s appraoch is more one of “for the greater good”, rather than the personal enrichment of the select few. The literary translation keynote from the FIT World Congress in Geneva given by Anton Hur touched on how Silicon Valley specialises in vaporware. - Metadata plays a massive role: when I started using a CAT tool, it was as simple as have source and target languages in TUs. However, since I started at the FMA, I have really understood the need for additional metadata – e.g. to indicate the supervisory area, locales, various usage fields etc. Volume/amount of data is not the only indicator of value of that data. BigTech always goes towards the “big(ger) is (more) beautiful” approach regarding the amount of data used to train LLMsA large language model (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. More. So yes, small (with good metadata) can be just as beautiful.
- “Getting the toothpaste back in the tube” – how do you remove data from models? One of the biggest asymmetries is the difficulty of removing data from a model without having to retrain the model compared to getting data in the model. It really is a black box. Similarly, with language constantly changing there is always the issue that data that is current now, does not always remain current, and therefore needs replacing. It would be interesting to understand the breakdown between the training and production costs of models, to understand how big the assymetry is (I can only assume it is a large one!).
- I’ve always understood the value of data, but not been able to value it: the platform for monetization of data really brought home the value of data – and reinforced that there is also real value in well-curated data (as I see from my work with translation memories and termbasesA terminology database – maintained in my case in MultiTerm. More). The “fuzziness” of synthetic data also brings home the real value that lies in human-curated data.
Related posts:
Abstract – ASTTI Summer Conference 2024 – Spiez
April 20, 2024Ten Takeaways from the ATA German Language Division Workshop in Vienna (22-23 February 2025)
March 2, 2025Talk given to students at the University of Graz
June 30, 2021Who’s in/on the lead in early 2025?
February 12, 20257 thoughts on Surviving #BigCONFERENCE
September 10, 2025Abstract: XXIII FIT World Congress – 4-6 September 2025 in Geneva
April 27, 2025Visited 3 times, 1 visit(s) today
Leave a Reply