Why sargonsays was created
“How do you preserve a dying language? Why even bother and what’s the point?”. These questions reverberate in my mind, incessantly ringing like a phone that no one is picking up.
I speak an esoteric language, one that is on the brink of extinction yet very much capable of bouncing back.
ISIS burn 8,000 rare books and manuscripts from Mosul, Iraq library.
These include Assyrian / Syriac texts. In 2 months, ISIS will attack and destroy Temple of Bel at World Heritage Site in Palmiyra, Syria
Oldest monastery in Iraq demolished. In addition to blowing up Assyrian Christian churches, terror group attacks countless Iraqi mosques the same year.
Terror group destroys a 3,000 year old Assyrian statue in Iraq, attempting to rewrite history. They smash lamassu statues while bulldozing palaces from the city of Nimrud. They take care not to destory everything on the historical site because they see monetary value in selling these historical artifacts.
If one can’t answer the question of why bother to preserve a dying language, one may approach the question via the inverse – Why NOT destroy a dying language?
In other words, one can ponder what would motivate a group like ISIS to attempt to destroy a language’s artifacts? The thought pattern applied here doesn’t require an answer to the original question. Rather, the principle here indicates the presence of a supporting answer to the original question by nature of a strong inverse.
Armed and obsessed with the vision of preserving my Assyrian language for generations, I yearn to evolve a simple idea… an online English to Assyrian dictionary with high fidelity search results.
Primary Customer and Critical Feature
Making the service accessible to native and non-native, arguably more important as those customers are the ones with network effects that wish to learn the language, Assyrian speakers is also important but it’s not THE feature. Creating a ReactJS response web app with an intuitive user experience is of extremely high priority but it’s not THE feature.
The most critical feature is search results ranking from the data corpus.
If you search for the color “red” and the FIRST result is not ‘smu-qa‘ or red in Assyrian, then this English to Assyrian direction will be in vain.
Contact maintainers of Assyrian data corpuses and see if either extending their user interface or scraping their data is possible
After contacting the maintainer, working in France, of a dictionary site for the Assyrian Eastern dialect, of which there are 2 dialects I need to support, we conclude that modifying his dictionary’s UI is possible but also that rearchitecting the data corpus to optimize for search results rankings is a herculean effort. This juice may not be worth the squeeze. Data corpus maintainer of the Western dialect only has data for the Western dialect but his organization’s search rankings are quite good. I determine good via the eye test. I randomly select from the 25,000 most commonly used English words and look at results from 100 results on both the Eastern and Western dialect sites.
Eastern dialect site = poor search result rankings. Search for the word “red” and the correct result ‘smu-qa’ does not appear on the 1st 10 pages of search results. The first result is the word for “predecessor” because it contains the characters “r” “e” “d”.
Western dialect site = good search result rankings.
Result of step 1: Scraped the search results from both site corpuses via python Scrapy crawling tool, looping through the 25,000 most commonly used English words list.
Step 2: Implement high fidelity search rankings via text analysis and relevance scoring.
Java and Lucene to the rescue! A java string similarity library as well as Lucene indexing prove to be vital tools in statically injecting search ranking metadata back into the new corpus.
One technique of the scoring involves String similarity on the UTF-encoded Assyrian Western dialect text from both sites
Site 1 versus Site 2 scrape?
Site 1 sometimes has both Eastern and Western, but poor search rankings. Site 2 only has Western dialect results but pretty good rankings. If a strongly similar Assyrian UTF-encoded Western text appears in the search results for an English word on both sites, then I can rank this search result highly.
A simplified heuristic works out really well. Place greater importance and weight via Lucene on the first 3 words of an English definition that don’t contain a definite / indefinite article, a number, and conjunction.
Now that I’ve created my own corpus and injected this data into MongoDB, I’m able to visually design a user interface that honors the small real estate on a user’s mobile device. I’m able to earn trust via 3 ways. 1 – every scraped search result definition (36,000 of the 36,600 total) has a deep link to the scraped source. 2 – every ranked search result (94,400) shows a relevance score for how meaningful that search result is to the user’s search. 3 – no advertisements on the homepage.
Step 4: Learn and deliver
“Good night” -> “I love you” -> “Will you marry me?”
In addition to adding features that kept users more engaged, such as auto-completion and related search terms, I learn even more from the search data that return no results. Users are searching for phrases. “Good night” -> “I love you” -> “Will you marry me?”. What starts out as a dictionary, which is a term to term search, is now being trusted as a translation engine. I add a phrases page and leverage other resources to add a Phrases section to the dictionary. This simple page gets the 2nd largest visits on the site. For search phrases that do not have a result, I now tokenize each word and provide a link to see the search results of each word in a phrase. This frugality bridges the gap of not yet being able to perform a real-time translation via syntactic tree banks or parse trees but still providing a value add that meets the user somewhere in between.
Software engineering is an exciting craft. I’ve been fortunate to incorporate learning and creativity into this innovation by picking up technologies I was not yet familiar with, including ReactJS, Docker, and text indexing, as well as really internalize the customer touch points as a UX designer.
We’ve enabled a language to live on and, in doing so, are encouraging the soul of a culture to pass on to the generations. Sargonsays.com, created December 2016, is a tool that promotes diversity of thought by preserving the Assyrian language via digital means. A mean that cannot easily be destroyed by a terrorist group. A mean that simplifies user interaction via 1 call to action on the homepage, a big search box and button that renders responsively on mobile, and one that delights via audio playback and written English phonetic pronunciation so that non-Assyrian speakers can have a comfortable experience that feels natural. Assyrian software enthusiasts including students and professionals, go out of their way to contact me, asking to help with the site and to add new features. Non-Assyrians contact me and let me know that they use the dictionary to feel closer to or even impress their Assyrian partner and family. These moments make life meaningful.