This reflective piece journals my internal dialogue and motivations for creating sargonsays, the online English to Assyrian Dictionary.
“How do you preserve a dying language? Why even bother and what’s the point?”. These questions reverberate in my mind, incessantly ringing like a phone that no one is picking up.
I speak an esoteric language, one that is on the brink of extinction yet very much capable of bouncing back.
Feb 2015… ISIS (Islamic State of Iraq and Syria) burn 8,000 rare books and manuscripts from Mosul, Iraq library. These include Syriac / Assyrian texts. In 2 months, ISIS will attack and destroy Temple of Bel at World Heritage Site in Palmiyra, Syria. https://bit.ly/307N5Q9
September 2014…. Oldest monastery in Iraq demolished. In addition to blowing up Assyrian Christian Churches, ISIS also attacks countless Iraqi mosques the same year.
May 2014… ISIS destroy 3,000 year old Assyrian statue in Iraq, attempting to rewrite history. ISIS smashes lamassu statues while bulldozing palaces from the city of Nimrud.They take care not to destroy everything on the historical site because they see the monetary value in selling these historical artifacts.
If one can’t answer the question of why bother to preserve a dying language, one may approach the question via the inverse – Why not destroy a language? In other words, one could ponder what would motivate a group like ISIS to attempt to destroy a language’s artifacts? The thought pattern applied here doesn’t require an answer to the original question. Rather, the principle here indicates the presence of a supporting answer to the original question by nature of a strong inverse.
Armed and obsessed with the vision of preserving my Assyrian language for successive generations, I yearn to evolve a simple idea… an online English to Assyrian dictionary with high fidelity search results.
Primary Customer and Critical Feature
Making the service accessible to native and non-native (perhaps more important as these customers are the ones wanting to learn the language) Assyrian speakers is also extremely important but it’s not THE feature. Creating a ReactJS responsive web app with an intuitive user experience is not going to be an issue, though it will be of extremely high priority.
The most critical feature is the ranking of search results from the data corpus. If you search for the color “red” and the first result is not ‘smu-qa’, or red in Assyrian, then this English to Assyrian dictionary will be in vain.
Step 1: Contact maintainers of Assyrian data corpuses and see if either extending their user interface or scraping their data is possible. After contacting the maintainer, working in France, of a dictionary site for the Assyrian Eastern dialect, of which there are 2 dialects I need to support, we conclude that modifying his dictionary’s UI is possible but also that rearchitecting the data corpus to optimize for search results rankings is a herculean effort. This juice may not be worth the squeeze. Data corpus maintainer of the Western dialect only has data for the Western dialect but his organization’s search rankings are quite good. I determine good via the eye test. I randomly select from the 25,000 most commonly used English words and look at results from 100 results on both the Eastern and Western dialect sites.
Eastern dialect site = poor search result rankings. search for the word “red” and the correct result ‘smu-qa’ does not appear on the 1st 10 pages of search results. The first result is the word for “predecessor” because it contains the characters “r” “e” “d”.
Western dialect site = good search result rankings.
Result of step 1: Scrape the search results from both site corpuses via python Scrapy crawling tool, looping through the 25,000 most commonly used English words list.
Step 2: Implement high fidelity search rankings via text analysis and relevance scoring – simplify wherever possible. Java and Lucene to the rescue! A java string similarity library as well as Lucene indexing prove to be vital tools in statically injecting search ranking metadata back into the new corpus.
One technique of the scoring involves String similarity on the UTF-encoded Assyrian Western dialect text from both sites. Site 1 sometimes has both Eastern and Western, but poor search rankings. Site 2 only has Western dialect results but pretty good rankings. If a strongly similar Assyrian UTF-encoded Western text appears in the search results for an English word on both sites, then I can rank this search result highly.
A simplified heuristic works out really well. Place greater importance and weight via Lucene on the first 3 words of an English definition that don’t contain a definite / indefinite article, a number, and conjunction.
Step 3: Delight the user with an experience that indicates trust
Now that I’ve created my own corpus and injected this data into MongoDB, I’m able to visually design a user interface that honors the small real estate on a user’s mobile device. I’m able to earn trust via 3 ways. 1 – every scraped search result definition (36,000 of the 36,600 total) has a deep link to the scraped source. 2 – every ranked search result (94,400) shows a relevance score for how meaningful that search result is to the user’s search. 3 – no advertisements on the homepage.
Step 4: Learn and deliver
In addition to adding features that kept users more engaged, such as auto-completion and related search terms, I learn even more from the search data that return no results. Users are searching for phrases. “Good night” -> “I love you” -> “Will you marry me?”. What starts out as a dictionary, which is a term to term search, is now being trusted as a translation engine. I add a phrases page and leverage other resources to add a Phrases section to the dictionary. This simple page gets the 2nd largest visits on the site. For search phrases that do not have a result, I now tokenize each word and provide a link to see the search results of each word in a phrase. This frugality bridges the gap of not yet being able to perform a real-time translation via syntactic tree banks or parse trees but still providing a value add that meets the user somewhere in between.
Software engineering is an exciting craft. I’ve been fortunate to incorporate learning and creativity into this innovation by picking up technologies I was not yet familiar with, including ReactJS, Docker, and text indexing, as well as really internalize the customer touch points as a UX designer.
We’ve enabled a language to live on and, in doing so, are encouraging the soul of a culture to pass on to the generations. Sargonsays.com, created December 2016, is a tool that promotes diversity of thought by preserving the Assyrian language via digital means. A mean that cannot easily be destroyed by a terrorist group. A mean that simplifies user interaction via 1 call to action on the homepage, a big search box and button that renders responsively on mobile, and one that delights via audio playback and written English phonetic pronunciation so that non-Assyrian speakers can have a comfortable experience that feels natural. Assyrian software enthusiasts including students and professionals, go out of their way to contact me, asking to help with the site and to add new features. Non-Assyrians contact me and let me know that they use the dictionary to feel closer to or even impress their Assyrian partner and family. These moments make life meaningful.