View Single Post
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#30
Originally Posted by rinigus View Post
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
I can try to find that kind of list or make it by myself. Should that list also include every conjugation of specific word? Finnish words have
dozens of conjugation forms. Here are few examples:
Word: run = juosta
  • I run = Minä juoksen
  • You run = Sinä juokset
  • He/she runs = Hän juoksee
Word: box = laatikko
  • The color of a box = Laatikon väri
  • Look at that box = Katso tuota laatikkoa
  • The cat went inside the box = Kissa meni laatikkoon
 

The Following User Says Thank You to FlyingAntero For This Useful Post: