Menu

Main Menu
Talk Get Daily Search

Member's Online

    User Name
    Password

    [Announcement]Open source text prediction input plugin

    Reply
    Page 3 of 4 | Prev |   1     2   3   4   | Next
    juiceme | # 21 | 2018-11-30, 10:21 | Report

    Originally Posted by ljo View Post
    Yes, I agree the Suomi-24 corpus is the best to start with.
    Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
    I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus!

    Edit | Forward | Quote | Quick Reply | Thanks

     
    FlyingAntero | # 22 | 2018-12-06, 06:42 | Report

    Originally Posted by juiceme View Post
    Would'n that be a bit biased... taken off from a forum which is full of halfwits banging their heads off on marginal topics?
    I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus!
    I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following User Says Thank You to FlyingAntero For This Useful Post:
    juiceme

     
    ljo | # 23 | 2018-12-06, 15:49 | Report

    Originally Posted by FlyingAntero View Post
    I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.
    OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, juiceme

     
    rinigus | # 24 | 2018-12-06, 19:55 | Report

    With such a huge file, we may have to split it into smaller parts. Otherwise RAM will probably become an issue.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to rinigus For This Useful Post:
    FlyingAntero, imaginaryenemy, juiceme

     
    FlyingAntero | # 25 | 2018-12-07, 03:19 | Report

    Originally Posted by ljo View Post
    OK. I bought a larger hard drive today since I have been hitting the storage limit over and over for a few weeks. So I could give it a try in a few days when I have migrated to the new drive.
    Nice! Here are the files:
    • Finnish Corpus

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to FlyingAntero For This Useful Post:
    juiceme, MartinK, rinigus

     
    ljo | # 26 | 2018-12-07, 09:07 | Report

    Originally Posted by FlyingAntero View Post
    Nice! Here are the files:
    • Finnish Corpus
    Thanks, I will get on it as soon as my harddrive is replaced.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 4 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, imaginaryenemy, juiceme, rinigus

     
    ljo | # 27 | 2018-12-11, 11:48 | Report

    Originally Posted by ljo View Post
    Thanks, I will get on it as soon as my harddrive is replaced.
    So, now there is something to test. I noticed some hyphenation here and there that felt a bit strange but most of the words i typed were predicted. And it learns fast so I can't make the same tests twice ...
    I might need to adjust the dictionary size a bit, but as a non-native speaker I await your opinions before doing something more for Finnish.
    I will try to find some time to continue to work on the hyphenation problems that are really annoying in Swedish at least.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    imaginaryenemy, juiceme, rinigus

     
    FlyingAntero | # 28 | 2018-12-12, 09:29 | Report

    I had time to test it this morning and it seems to work pretty good after quick testing . I can confirm that there is a hyphenation problem with some words. However, it is not a big problem in normal use since the issue seems to be linked to compound words. Here is few examples:
    English: Finnish: my input: text-prediction
    • text input: tekstinsyöttö: tekstinsyö: tekstin-syö
    • shoe rack: kenkäteline: kenkäte: kenkä-te
    • (space) alien: avaruusolio: avaruusoli: avaruus-oli
    I think that most Finns write compound words separately (tekstin and syöttö) and remove the space later (if they aren't too lazy). If you do that the prediction knows those separate words.

    I put the text-prediction for comparison with an Android phone and both predictions were working quite similarly with most common words. Sometimes the most obvious conjugation is among the last words in the list but I believe that will improve after use (in Sailfish).

    Also the prediction knows every bad words in Finnish and some name-calling slang words. I believe that it is not a surprise since the corpus was from forum.

    EDIT: And I almost forgot: huge thanks for you, tusen tack!

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following User Says Thank You to FlyingAntero For This Useful Post:
    juiceme

     
    rinigus | # 29 | 2018-12-12, 10:03 | Report

    Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to rinigus For This Useful Post:
    FlyingAntero, juiceme

     
    FlyingAntero | # 30 | 2018-12-12, 10:39 | Report

    Originally Posted by rinigus View Post
    Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
    I can try to find that kind of list or make it by myself. Should that list also include every conjugation of specific word? Finnish words have
    dozens of conjugation forms. Here are few examples:
    Word: run = juosta
    • I run = Minä juoksen
    • You run = Sinä juokset
    • He/she runs = Hän juoksee
    Word: box = laatikko
    • The color of a box = Laatikon väri
    • Look at that box = Katso tuota laatikkoa
    • The cat went inside the box = Kissa meni laatikkoon

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following User Says Thank You to FlyingAntero For This Useful Post:
    juiceme

     
    Page 3 of 4 | Prev |   1     2   3   4   | Next
vBulletin® Version 3.8.8
Normal Logout