Menu

Main Menu
Talk Get Daily Search

Member's Online

    User Name
    Password

    [Announcement]Open source text prediction input plugin

    Reply
    Page 2 of 4 | Prev |   1   2   3     4   | Next
    rinigus | # 11 | 2018-11-28, 16:31 | Report

    Originally Posted by FlyingAntero View Post
    I installed https://openrepos.net/content/sailfi...nput-predictor to my X Compact (using official patched image from Xperia X) and it is working like a charm. However, swedish is my second language. Can anyone help to make layout for finnish?

    I have found data base for finnish words in UTF-8 format from Github:
    • https://github.com/hugovk/everyfinnishword
    Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

    You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

    Please look into it - would be great to extend the support to Finnish.

    Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to rinigus For This Useful Post:
    carlosgonz, FlyingAntero, juiceme

     
    ljo | # 12 | 2018-11-28, 19:35 | Report

    Originally Posted by rinigus View Post
    Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus ...
    Please look into it - would be great to extend the support to Finnish.

    Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?
    I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, juiceme, rinigus

     
    FlyingAntero | # 13 | 2018-11-29, 06:06 | Report

    Originally Posted by rinigus View Post
    Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

    You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

    Please look into it - would be great to extend the support to Finnish.
    OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files. There is two 25Gb zip files and few smaller zip files. Is that a problem?

    EDIT: Here is more information about the data. It is in VRT file format:
    • https://www.kielipankki.fi/developme...-input-format/

    Originally Posted by ljo View Post
    I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.
    I would be really grateful for help since my programming skills are very limited.

    Edit | Forward | Quote | Quick Reply | Thanks

    Last edited by FlyingAntero; 2018-11-29 at 06:15.
    The Following 2 Users Say Thank You to FlyingAntero For This Useful Post:
    juiceme, rinigus

     
    ljo | # 14 | 2018-11-29, 08:12 | Report

    Originally Posted by FlyingAntero View Post
    OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files.

    EDIT: Here is more information about the data. It is in VRT file format:
    • https://www.kielipankki.fi/developme...-input-format/
    I would be really grateful for help since my programming skills are very limited.
    That sounds like a good start.Then we can see if we need more data from our partner Kielipankki. Make sure to include some social media resources too in the first batch. Multiple source files are no problem. Concatenate them if it feels easier to handle a single source for you.
    No problem, happy to help.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, juiceme, rinigus

     
    rinigus | # 15 | 2018-11-29, 08:21 | Report

    From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to rinigus For This Useful Post:
    FlyingAntero, juiceme

     
    ljo | # 16 | 2018-11-29, 09:42 | Report

    Originally Posted by rinigus View Post
    From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.
    There is nothing wrong with using the VRT files. They just need be processed to extract the running text tokens. So if we do this firstly for what you have we will see if further data is needed (propably since a lot of annotations are added to our annotated VRT files adding to the file sizes).

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, juiceme, rinigus

     
    rinigus | # 17 | 2018-11-29, 10:35 | Report

    @ljo: sounds great! Good luck with it!

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to rinigus For This Useful Post:
    FlyingAntero, juiceme

     
    FlyingAntero | # 18 | 2018-11-29, 10:48 | Report

    I took a closer look of the data that is available from Kielipankki:
    • The Suomi 24 Corpus, ~60Gb: the largest discussion forum in Finland
    • FNC1, ~30Gb: The National Library's journal's Finnish n-grams (1820-2000)
    • DSPCON, ~4Gb :Aalto University DSP Course Conversation Corpus
    • AMPH, 600Mb: Think, ponder, consider -corpus
    • The SFNET corpus, 400Mb: a quite small discussion forum
    • Ylilauta Coprus, 300Mb: a Finnish version of 4chan
    • Opusparcus, 265Mb: Open Subtitles Paraphrase
    • Psycholinguistic corpus, 65Mb: Psycholinguistic Descriptives
    • Morphologies, 50Mb: Morphologies

    In addition to that there is a corpus data set of several Finnish magazines and newspapers from the 1990s and 2000s (around 300 magazines). However, I downloaded three of them which were dealing with tech and the size of one magazine was only ~1Mb. Also you have to download each of them separately.

    EDIT: Some of them are in VRT format and other in TXT format.

    Edit | Forward | Quote | Quick Reply | Thanks

    Last edited by FlyingAntero; 2018-11-29 at 10:54.
    The Following 3 Users Say Thank You to FlyingAntero For This Useful Post:
    juiceme, MartinK, rinigus

     
    rinigus | # 19 | 2018-11-29, 15:58 | Report

    FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to rinigus For This Useful Post:
    FlyingAntero, juiceme

     
    ljo | # 20 | 2018-11-30, 10:09 | Report

    Originally Posted by rinigus View Post
    FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.
    Yes, I agree the Suomi-24 corpus is the best to start with.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to ljo For This Useful Post:
    FlyingAntero, juiceme

     
    Page 2 of 4 | Prev |   1   2   3     4   | Next
vBulletin® Version 3.8.8
Normal Logout