Reply
Thread Tools
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#11
Originally Posted by FlyingAntero View Post
I installed https://openrepos.net/content/sailfi...nput-predictor to my X Compact (using official patched image from Xperia X) and it is working like a charm. However, swedish is my second language. Can anyone help to make layout for finnish?

I have found data base for finnish words in UTF-8 format from Github:
Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

Please look into it - would be great to extend the support to Finnish.

Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?
 

The Following 3 Users Say Thank You to rinigus For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#12
Originally Posted by rinigus View Post
Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus ...
Please look into it - would be great to extend the support to Finnish.

Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?
I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#13
Originally Posted by rinigus View Post
Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

Please look into it - would be great to extend the support to Finnish.
OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files. There is two 25Gb zip files and few smaller zip files. Is that a problem?

EDIT: Here is more information about the data. It is in VRT file format:
Originally Posted by ljo View Post
I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.
I would be really grateful for help since my programming skills are very limited.

Last edited by FlyingAntero; 2018-11-29 at 06:15.
 

The Following 2 Users Say Thank You to FlyingAntero For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#14
Originally Posted by FlyingAntero View Post
OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files.

EDIT: Here is more information about the data. It is in VRT file format:I would be really grateful for help since my programming skills are very limited.
That sounds like a good start.Then we can see if we need more data from our partner Kielipankki. Make sure to include some social media resources too in the first batch. Multiple source files are no problem. Concatenate them if it feels easier to handle a single source for you.
No problem, happy to help.
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#15
From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.
 

The Following 2 Users Say Thank You to rinigus For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#16
Originally Posted by rinigus View Post
From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.
There is nothing wrong with using the VRT files. They just need be processed to extract the running text tokens. So if we do this firstly for what you have we will see if further data is needed (propably since a lot of annotations are added to our annotated VRT files adding to the file sizes).
 

The Following 3 Users Say Thank You to ljo For This Useful Post:
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#17
@ljo: sounds great! Good luck with it!
 

The Following 2 Users Say Thank You to rinigus For This Useful Post:
Posts: 36 | Thanked: 118 times | Joined on Nov 2018
#18
I took a closer look of the data that is available from Kielipankki:
  • The Suomi 24 Corpus, ~60Gb: the largest discussion forum in Finland
  • FNC1, ~30Gb: The National Library's journal's Finnish n-grams (1820-2000)
  • DSPCON, ~4Gb :Aalto University DSP Course Conversation Corpus
  • AMPH, 600Mb: Think, ponder, consider -corpus
  • The SFNET corpus, 400Mb: a quite small discussion forum
  • Ylilauta Coprus, 300Mb: a Finnish version of 4chan
  • Opusparcus, 265Mb: Open Subtitles Paraphrase
  • Psycholinguistic corpus, 65Mb: Psycholinguistic Descriptives
  • Morphologies, 50Mb: Morphologies

In addition to that there is a corpus data set of several Finnish magazines and newspapers from the 1990s and 2000s (around 300 magazines). However, I downloaded three of them which were dealing with tech and the size of one magazine was only ~1Mb. Also you have to download each of them separately.

EDIT: Some of them are in VRT format and other in TXT format.

Last edited by FlyingAntero; 2018-11-29 at 10:54.
 

The Following 3 Users Say Thank You to FlyingAntero For This Useful Post:
Posts: 1,414 | Thanked: 7,547 times | Joined on Aug 2016 @ Estonia
#19
FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.
 

The Following 2 Users Say Thank You to rinigus For This Useful Post:
Posts: 102 | Thanked: 187 times | Joined on Jan 2010
#20
Originally Posted by rinigus View Post
FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.
Yes, I agree the Suomi-24 corpus is the best to start with.
 

The Following 2 Users Say Thank You to ljo For This Useful Post:
Reply

Tags
predictive text, presage, text-prediction


 
Forum Jump


All times are GMT. The time now is 23:14.