I’ve implemented an automatic message translation from russian to Ukrainian in my Telegram chat. I won’t delve into the details of why this was necessary. Some users were pleased with this feature, while others were annoyed or questioned its purpose. Let’s say I took on this technical challenge for personal growth. I primarily handle organizational tasks that drain my energy at work, and I’m not exceptionally skilled or passionate about them.
So, let’s say I undertook this project to learn new technology and derive positive emotions from acquiring new skills and accomplishing something independently. Additionally, it emphasizes that russian is a foreign language in Ukrainian spaces and subtly encourages people to switch to Ukrainian or perhaps even abandon the chat altogether. Alright, let’s move on to the technical side.
I have a bot that runs on PHP, the programming language I primarily use. First, I looked for a solution to detect the language of the messages, as I didn’t want to send every message to the translation API. I came across a LanguageDetector library on GitHub (landrok/language-detector). After incorporating it into the bot, I decided to calculate the scores for each language and compare them. It was necessary because the library’s language detection alone often misidentified Cyrillic languages. More extensive datasets may be needed, but I couldn’t find them immediately.
After experimenting and testing with short phrases, I subtracted the Ukrainian score from the russian score. If the difference was greater than 0.016 and the total russian score was above 0.05, the message was considered in russian. Eventually, I settled on values of 0.02 and 0.1 to reduce the frequency of triggering the transition for short words and phrases, where the translation wouldn’t be significantly different anyway.
For the translation itself, I chose DeepL since it’s currently the best option on the market to the best of my knowledge. Making requests to the DeepL API for message translation was straightforward. I registered for the free API, which provides half a million characters per month, which should be sufficient for the chat where most messages are already in Ukrainian.
Initially, I set the bot to detect the language automatically, but later on, I instructed it to translate from russian. It was necessary because the language detection didn’t always identify russian correctly and occasionally sent Ukrainian messages for translation. Moreover, since the messages were already in Ukrainian, DeepL sometimes detected the input language as another Cyrillic language, such as Bulgarian, which resulted in amusing translation outcomes.
After a day of testing the translation functionality, I observed that the language detection often failed and sent Ukrainian messages for translation. I attempted to find better solutions but couldn’t find an affordable API. However, while going through the issues on the language detector repository on GitHub, I discovered a recommendation for the Compact Language Detector 2 PHP Extension (fntlnz/cld2-php-ext), which was said to be superior.
At the start, I was scared because building the extension from the source code seemed daunting, and I had problems with that in the past. Nevertheless, I decided to give it a try. Unfortunately, the compilation failed, and my attempts to find a solution through online searches were fruitless. However, upon revisiting the repository’s issues section, I learned that the extension was incompatible with PHP 8, but there was a pull request to make it compatible (hiteule/cld2-php-ext/tree/support-php8). Consequently, I downloaded the extension from the PR branch, which was built successfully.
Subsequently, I realized that my bot was running on PHP 7 while the CLI was on PHP 8. I considered switching the PHP version of the CLI, but instead, I decided to enable PHP 8. This change should improve performance, and there were no compatibility issues. Throughout this process, I also set up an SSH key on GitHub, which I hadn’t done before. It allowed me to clone repositories via SSH and was a new experience for me on GitHub. Fortunately, it was a straightforward process as I often use SSH for other tasks and have also added keys to other services.
The CLD2 language detection algorithm performs much better with complete sentences, correctly identifying the language. However, it struggles to detect the language for short phrases and single words and doesn’t provide any suggestions in such cases. Consequently, I implemented a fallback mechanism to the previous Language Detector in situations where the language couldn’t be detected or when the detection score for russian was less than 97%. I noticed that when the detection score was below 97%, it could still be some broken Ukrainian or a phrase that didn’t require translation.
Despite these efforts, there were still cases of language detection failure, which meant that translations were shown even when no translation had occurred. After some experimentation, I devised a solution that involved extracting only the Cyrillic lowercase letters from the original text and the translated text and then comparing the strings of these letters. The translation was not displayed if the strings matched by 85% or more. This approach filtered out cases where the text remained unchanged after translation or had only minor differences. It also addressed the issue with the Ukrainian apostrophe, which is represented differently on various keyboards and systems.
This project provided an excellent opportunity to re-engage in activities I enjoy. In the past month, I noticed a recurring suggestion to focus on tasks we are passionate about and excel at. The sense of accomplishment I derived from implementing this small functionality surpassed that of any video game completion. Unfortunately, the practical usefulness of this implementation is limited, and despite receiving initial positive feedback, I now face constant negative feedback. I even set the bot to post translations without replying to the initial message. To prevent such situations, I need to become a better product owner, but that’s a challenge I’ll reserve for my work instead of my hobby.