[Announcement]Open source text prediction input plugin - Page 2 - maemo.org

Active Topics

How many of you still use N900 as main phone? (595)
to Community by deutch1976 - 1 hr, 40 mins ago
Big news (7)
to OS2008 / Maemo 4 / Chinook - Diablo by endsormeans - 1 day, 10 hrs ago
Firefox with Leste (8)
to Maemo 7 / Leste by endsormeans - 1 day, 10 hrs ago
Which is the best N95? What software modifications could be made to it? (3)
to General by Kalatti - 4 days, 11 hrs ago
Installing CSSU Stable in year 2024 (2)
to Maemo 5 / Fremantle by teroyk - 4 days, 18 hrs ago
Porting apps to Leste (34)
to Maemo 7 / Leste by Arno_11 - 5 days, 4 hrs ago
[ANNOUNCE] CSSU-thumb thread - stable Thumb2 on N900 (2,266)
to Maemo 5 / Fremantle by pali - 6 days, 20 hrs ago
more...

Page 2 of 4

< Prev

Next >

Thread Tools

rinigus	2018-11-28 , 16:31
Posts: 1,414 \| Thanked: 7,547 times \| Joined on Aug 2016 @ Estonia	#11

Originally Posted by FlyingAntero

I installed https://openrepos.net/content/sailfi...nput-predictor to my X Compact (using official patched image from Xperia X) and it is working like a charm. However, swedish is my second language. Can anyone help to make layout for finnish?

I have found data base for finnish words in UTF-8 format from Github:
https://github.com/hugovk/everyfinnishword

Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

Please look into it - would be great to extend the support to Finnish.

Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?

Quote & Reply |

The Following 3 Users Say Thank You to rinigus For This Useful Post:
carlosgonz, FlyingAntero, juiceme

ljo	2018-11-28 , 19:35
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#12

Originally Posted by rinigus

Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus ...
Please look into it - would be great to extend the support to Finnish.

Also, the layout is the same for finnish and swedish. Is it possible to just change the data base?

I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.

Quote & Reply |

The Following 3 Users Say Thank You to ljo For This Useful Post:
FlyingAntero, juiceme, rinigus

FlyingAntero	2018-11-29 , 06:06
Posts: 36 \| Thanked: 118 times \| Joined on Nov 2018	#13

Originally Posted by rinigus

Would be great to get Finnish on board. You will need not the list of words, but large body of Finnish texts, called text corpus (https://en.wikipedia.org/wiki/Text_corpus). This is since we want to teach how to "predict" and it can be done if you know the common sequences in the language. Works for Estonian as well - so should work for Finnish too.

You may need to contact some language institute to get such text body. For Estonian, I managed to get large text corpus - about 1900GB of text. But probably smaller text would give a decent result.

Please look into it - would be great to extend the support to Finnish.

OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files. There is two 25Gb zip files and few smaller zip files. Is that a problem?

EDIT: Here is more information about the data. It is in VRT file format:

https://www.kielipankki.fi/developme...-input-format/

Originally Posted by ljo

I could probably help @FlyingAntero to achieve this for Finnish like I created the Swedish resources. For the last question - yes, basically you could switch out the database, but in the long run it will be easier to do the full package now to get the language specific support and switching correct.

I would be really grateful for help since my programming skills are very limited.

Last edited by FlyingAntero; 2018-11-29 at 06:15.

Quote & Reply |

The Following 2 Users Say Thank You to FlyingAntero For This Useful Post:
juiceme, rinigus

ljo	2018-11-29 , 08:12
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#14

Originally Posted by FlyingAntero

OK, now I got it. Text corpus makes sence for prediction. I have access to text corpus data which is about 60Gb. Is that enought or should I try to search bigger one? The data that I have found is in different zip files.

EDIT: Here is more information about the data. It is in VRT file format:
https://www.kielipankki.fi/developme...-input-format/

I would be really grateful for help since my programming skills are very limited.

That sounds like a good start.Then we can see if we need more data from our partner Kielipankki. Make sure to include some social media resources too in the first batch. Multiple source files are no problem. Concatenate them if it feels easier to handle a single source for you.
No problem, happy to help.

Quote & Reply |

The Following 3 Users Say Thank You to ljo For This Useful Post:
FlyingAntero, juiceme, rinigus

rinigus	2018-11-29 , 08:21
Posts: 1,414 \| Thanked: 7,547 times \| Joined on Aug 2016 @ Estonia	#15

From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.

Quote & Reply |

The Following 2 Users Say Thank You to rinigus For This Useful Post:
FlyingAntero, juiceme

ljo	2018-11-29 , 09:42
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#16

Originally Posted by rinigus

From brief reading, looks like VRT is not the text but processed list of tokens. For training, either plain text or already processed as n-grams (latter used for Russian) is needed. But there should be text corpus behind these processed files as well.

There is nothing wrong with using the VRT files. They just need be processed to extract the running text tokens. So if we do this firstly for what you have we will see if further data is needed (propably since a lot of annotations are added to our annotated VRT files adding to the file sizes).

Quote & Reply |

The Following 3 Users Say Thank You to ljo For This Useful Post:
FlyingAntero, juiceme, rinigus

rinigus	2018-11-29 , 10:35
Posts: 1,414 \| Thanked: 7,547 times \| Joined on Aug 2016 @ Estonia	#17

@ljo: sounds great! Good luck with it!

Quote & Reply |

The Following 2 Users Say Thank You to rinigus For This Useful Post:
FlyingAntero, juiceme

FlyingAntero	2018-11-29 , 10:48
Posts: 36 \| Thanked: 118 times \| Joined on Nov 2018	#18

I took a closer look of the data that is available from Kielipankki:

The Suomi 24 Corpus, ~60Gb: the largest discussion forum in Finland
FNC1, ~30Gb: The National Library's journal's Finnish n-grams (1820-2000)
DSPCON, ~4Gb :Aalto University DSP Course Conversation Corpus
AMPH, 600Mb: Think, ponder, consider -corpus
The SFNET corpus, 400Mb: a quite small discussion forum
Ylilauta Coprus, 300Mb: a Finnish version of 4chan
Opusparcus, 265Mb: Open Subtitles Paraphrase
Psycholinguistic corpus, 65Mb: Psycholinguistic Descriptives
Morphologies, 50Mb: Morphologies

In addition to that there is a corpus data set of several Finnish magazines and newspapers from the 1990s and 2000s (around 300 magazines). However, I downloaded three of them which were dealing with tech and the size of one magazine was only ~1Mb. Also you have to download each of them separately.

EDIT: Some of them are in VRT format and other in TXT format.

Last edited by FlyingAntero; 2018-11-29 at 10:54.

Quote & Reply |

The Following 3 Users Say Thank You to FlyingAntero For This Useful Post:
juiceme, MartinK, rinigus

rinigus	2018-11-29 , 15:58
Posts: 1,414 \| Thanked: 7,547 times \| Joined on Aug 2016 @ Estonia	#19

FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.

Quote & Reply |

The Following 2 Users Say Thank You to rinigus For This Useful Post:
FlyingAntero, juiceme

ljo	2018-11-30 , 10:09
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#20

Originally Posted by rinigus

FNC1 may allow you to cut the corners and get it running without any stats since it's already done for you. Although , language may have changed in this time window ... otherwise , probably the first one is of the biggest interest.

Yes, I agree the Suomi-24 corpus is the best to start with.