View Single Post
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#29
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
 

The Following 2 Users Say Thank You to rinigus For This Useful Post: