maemo.org - Talk

maemo.org - Talk (https://talk.maemo.org/index.php)
-   SailfishOS (https://talk.maemo.org/forumdisplay.php?f=52)
-   -   [Announcement]Open source text prediction input plugin (https://talk.maemo.org/showthread.php?t=100266)

juiceme 2018-11-30 10:21

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by ljo (Post 1551307)
Yes, I agree the Suomi-24 corpus is the best to start with.

Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus! :p

FlyingAntero 2018-12-06 06:42

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by juiceme (Post 1551308)
Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus! :p

I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.

ljo 2018-12-06 15:49

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by FlyingAntero (Post 1551476)
I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.

OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.

rinigus 2018-12-06 19:55

Re: [Announcement]Open source text prediction input plugin
 
With such a huge file, we may have to split it into smaller parts. Otherwise RAM will probably become an issue.

FlyingAntero 2018-12-07 03:19

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by ljo (Post 1551494)
OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.

Nice! Here are the files:

ljo 2018-12-07 09:07

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by FlyingAntero (Post 1551513)
Nice! Here are the files:

Thanks, I will get on it as soon as my harddrive is replaced.

ljo 2018-12-11 11:48

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by ljo (Post 1551521)
Thanks, I will get on it as soon as my harddrive is replaced.

So, now there is something to test. I noticed some hyphenation here and there that felt a bit strange but most of the words i typed were predicted. And it learns fast so I can't make the same tests twice ...
I might need to adjust the dictionary size a bit, but as a non-native speaker I await your opinions before doing something more for Finnish.
I will try to find some time to continue to work on the hyphenation problems that are really annoying in Swedish at least.

FlyingAntero 2018-12-12 09:29

Re: [Announcement]Open source text prediction input plugin
 
I had time to test it this morning and it seems to work pretty good after quick testing :). I can confirm that there is a hyphenation problem with some words. However, it is not a big problem in normal use since the issue seems to be linked to compound words. Here is few examples:
English: Finnish: my input: text-prediction
  • text input: tekstinsyöttö: tekstinsyö: tekstin-syö
  • shoe rack: kenkäteline: kenkäte: kenkä-te
  • (space) alien: avaruusolio: avaruusoli: avaruus-oli
I think that most Finns write compound words separately (tekstin and syöttö) and remove the space later (if they aren't too lazy). If you do that the prediction knows those separate words.

I put the text-prediction for comparison with an Android phone and both predictions were working quite similarly with most common words. Sometimes the most obvious conjugation is among the last words in the list but I believe that will improve after use (in Sailfish).

Also the prediction knows every bad words in Finnish and some name-calling slang words. I believe that it is not a surprise since the corpus was from forum.

EDIT: And I almost forgot: huge thanks for you, tusen tack!

rinigus 2018-12-12 10:03

Re: [Announcement]Open source text prediction input plugin
 
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...

FlyingAntero 2018-12-12 10:39

Re: [Announcement]Open source text prediction input plugin
 
Quote:

Originally Posted by rinigus (Post 1551688)
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...

I can try to find that kind of list or make it by myself. Should that list also include every conjugation of specific word? Finnish words have
dozens of conjugation forms. Here are few examples:
Word: run = juosta
  • I run = Minä juoksen
  • You run = Sinä juokset
  • He/she runs = Hän juoksee
Word: box = laatikko
  • The color of a box = Laatikon väri
  • Look at that box = Katso tuota laatikkoa
  • The cat went inside the box = Kissa meni laatikkoon


All times are GMT. The time now is 04:47.

vBulletin® Version 3.8.8