maemo.org - Talk

maemo.org - Talk (https://talk.maemo.org/index.php)
-   SailfishOS (https://talk.maemo.org/forumdisplay.php?f=52)
-   -   Advanced text entry on Sailfish (Swype or similar) (https://talk.maemo.org/showthread.php?t=92764)

ljo 2016-01-25 11:33

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by ferlanero (Post 1496239)
Now currently working on Swedish language ;)

Err, why, I already maintain and published the Swedish language resources during new year's weekend.

ferlanero 2016-01-25 11:49

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by ljo (Post 1496251)
Err, why, I already maintain and published the Swedish language resources during new year's weekend.

:D Ha, ha! It's true I didn't realize about it! Sorry. Focusing now in Portuguese :)

spidernik84 2016-01-25 20:02

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by ljo (Post 1496196)
@spidernik84 et al, this should rather be between 0.7-1.8 million wordforms but not much more based on the 92034 stems (roughly what we count as words) which is about the size of a standard working vocabulary of other latin script languages like french (0.63 million aspell wordforms). So there is something wrong with the assumptions in the expansion processing.

I think you are right. I just failed another generation attempt (ran out of 20GB of RAM plus 5GB of swap... ).
I did a comparison with the English language, this is what I see:

Code:

nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l en dump master | aspell -l en expand | wc
 119789  119789 1153336
nico@hendrix:~/aspell/aspell6-it-2.4-20070901-0$ aspell -l it dump master | aspell -l it expand | wc
  95193 36636439 655315062

The number of words generated for the Italian language is INSANE.
You seem to know a lot of this. Have you got any idea of what can be done to keep the dictionary smaller? I've been searching for aspell alternative dictionaries with no luck...

Thanks. I surely hope we don't need to rent a Cray cluster to generate this dict... :)

eber42 2016-01-25 20:06

Re: Advanced text entry on Sailfish (Swype or similar)
 
As discussed with spidernik84, the Italian aspell dictionary contains 34M words (with affix expansion support that was added for Spanish).
Try this :
Code:

aspell -l it dump master | aspell -l it expand | wc -w
In the current process, aspell is used for filtering out badly written words (because available texts sometimes contains errors).

Even if we fix the corpus reader script the keyboard has not been built to work with this volume: My largest language (French) contains ~100k words (and only 45k used by the word prediction engine, others are in "best effort" mode).

From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc:)

ljo 2016-01-25 20:17

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by spidernik84 (Post 1496332)
The number of words generated for the Italian language is INSANE.
You seem to know a lot of this. Have you got any idea of what can be done to keep the dictionary smaller? I've been searching for aspell alternative dictionaries with no luck...

I reduced the size by 3/4 by removing different capitalisations of the same words in the Italian dictionary. It is true some small fraction might actually be different words, but the majority is just lowercase initial letter vs uppercase initial letter differences. Comment out the %-full.dict target in the db/makefile and put the filtered word list content directly in your it-full.dict file (reduce it by axing further parts of it off if needed still).

spidernik84 2016-01-25 20:24

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by eber42 (Post 1496334)
From a quick look I see the following causes for the large size:
  • lots of words are repeated with prefixes such as dall', sull'. At the moment my model handles words separated by quotation marks or hyphens as single words, so words with different prefixes are treated as different words. OKBoard roadmap contains an item for managing prefixes and suffixes (explicit ones with punctuation signs, or linked together as in German) as distinct words, but I don't know when (and if) I will work on it.
  • some words are repeated multiple times with weird capitalization: are these different words: Sull'Acclimatatele, sull'Acclimatatele, sull'acclimatatele ? At the moment words with different capitalizations are treated as different words (unless they are at the beginning of a sentence). But the case of words with two different capitalization is not very well handled.

Hello Eber!
I never heard those words before :)
I can tell you for sure that the form dall' sull' is surely correct, but a bit too formulaic. Also, those are "articulated prepositions" in front of nouns, hence should be considered on their own. Example:

dall'anima
dall'oceano

The nouns are "anima" and "oceano", while "dall'" is the preposition. That does not justify creating a word for each preposition+word combination!
There are additional rules, naturally: for instance, that form is only used with words starting with vocals...
Good that you are thinking of handling this situation.

As for the capitalization: I would not consider common to have capitalised variants of words. Most words are either capitalised or not, so I'd prioritise lower case words when multiple variants are found.

Quote:


Spidernik84's text corpus only contains 315k words (only counting those which are also known by aspell), so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

Edit: ouch, spidernik84 was faster with wc:)
We can try to skip aspell just for my language, for sure... I'm afraid of the results though: spelling mistakes are definitely common :D
It's worth a shot, I'll see what happens. Thanks for your help.

ljo 2016-01-25 20:31

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by eber42 (Post 1496334)
1) But the case of words with two different capitalization is not very well handled.
...
2) so my short term suggestion is to add an option to provide a (smaller) dictionary instead of using aspell's one or to trust the input corpus to be flawless.

What do you think ?

1) It is definitely true. I saw this with the Spanish dictionary too when I did the full corpus.

2) Yes, providing an alternative dictionary is good. Maybe just keep the dict if its there? Instruct to build clean otherwise? Assuming they are flawless is not too bad either since people still write a lot of stuff which is not covered by the aspell dictionary.

itdoesntmatt 2016-01-25 20:50

Re: Advanced text entry on Sailfish (Swype or similar)
 
sorry guys, i dont know how many of you are italian, but i am.
Dall' Sull' and other words could be just inserted as single words.
When you write sentences you actually left a space between preposition and other word.
so i think it would be better to have two words splitted:
dall/dall' (showing both option when swyped d-a-l-l ) and anima for example.

However sull' Acclimatatele for example doesnt make sense.
Sull is a preposition that preceed some noun and means over/on/regarding. for example Sull' Oceano. it means literally "over the ocean".
the ' is inserted just becaus Oceano starts with a vocal letter!

And however acclimatatele is such a very unusual word in common speech. "acclimatare" means "to get habitued to some climate condition" (for example, when you are out in the cold winter and come into your home, you spend your first minutes just to "get used" to the hotter condition).

"Acclimatate " is one of the possible conjugations (participio passato) of this verb when referring to female &plural nouns (a group of women for example).

"AcclimatateLE" it literally means "make them acclimatized/ambiented"

so i mean, those are words not very frequently used in speech.
sorry for my bad teacher skills.

spidernik84 2016-01-25 21:01

Re: Advanced text entry on Sailfish (Swype or similar)
 
Quote:

Originally Posted by itdoesntmatt (Post 1496345)
sorry guys, i dont know how many of you are italian, but i am.
Dall' Sull' and other words could be just inserted as single words.
When you write sentences you actually left a space between preposition and other word.

Ciao!
I am pretty confident there should be no space between article and nouns and articulated prepositions and nouns. This is the only input I can give :)

itdoesntmatt 2016-01-25 21:08

Re: Advanced text entry on Sailfish (Swype or similar)
 
Ciao a voi,ragazzi :) e grazie tante per il vostro impegno!


i know, but i explained mayself badly.
When you write a sentence :
example : il gatto e' sull'Amaca
i swipe in this way: .. I-L..G-A-T-T-O... E'.. S-U-L-L(') ..A-M-A-C-A

is not comfortable to swipe S-U-L-L-'-A-M-A-C-A
because we consider them as separated words when we think about that. Sull'Amaca is considered just like Sul Letto, as two separated words, even if you formally shouldnt leave the space.
and moreover in common written language (included SMS,chat and other stuff ) is really the same to leave space between preposition with ' and the other following word.
i dont know how to explain better i hope it is understandable.


All times are GMT. The time now is 23:46.

vBulletin® Version 3.8.8