Advanced text entry on Sailfish (Swype or similar) - Page 18 - maemo.org

Active Topics

Porting apps to Leste (32)
to Maemo 7 / Leste by Arno_11 - 13 hrs, 35 mins ago
[ANNOUNCE] CSSU-thumb thread - stable Thumb2 on N900 (2,264)
to Maemo 5 / Fremantle by panjgoori - 2 days, 7 hrs ago
Full linux distros on Sailfish OS (254)
to SailfishOS by qoh - 3 days, 5 hrs ago
more...

Page 18 of 38

Thread Tools

cvp	2016-01-07 , 20:13
Posts: 738 \| Thanked: 819 times \| Joined on Jan 2012 @ Berlin	#171

thank you for your how to. i have one problem :

~/okb-engine/db $ sh build.sh de
build.sh: 5: set: Illegal option -o pipefail

PS: if i want check this:
echo $VARIABLE_NAME

i get no list.

__________________

www.sailfishmods.de

Last edited by cvp; 2016-01-07 at 20:19.

Quote & Reply |

ljo	2016-01-07 , 20:58
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#172

@ferlanero, it is better to direct people to http://git.tuxfamily.org/okboard/okb...tree/README.md directly. Like I said before you cannot use a dictionary as corpus, it does not work, since it needs to be running texts with actual sentences, a lot of sentences, like explained in the README.md.

Quote & Reply |

The Following 3 Users Say Thank You to ljo For This Useful Post:
Feathers McGraw, juiceme, ssahla

ljo	2016-01-07 , 21:16
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#173

Originally Posted by cvp

~/okb-engine/db $ sh build.sh de
build.sh: 5: set: Illegal option -o pipefail

Then your shell does not support this option. There are some workarounds if your shell is dash, but if you have bash installed use that explicitly.

Originally Posted by cvp

PS: if i want check this:
echo $VARIABLE_NAME

i get no list.

Then the variable is not set. It (VARIABLE_NAME) should btw be one of the two variables CORPUS_DIR or WORK_DIR. But if they are not set the script will tell you.

Quote & Reply |

ferlanero	2016-01-07 , 22:41
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#174

Originally Posted by ljo

@ferlanero, it is better to direct people to http://git.tuxfamily.org/okboard/okb...tree/README.md directly. Like I said before you cannot use a dictionary as corpus, it does not work, since it needs to be running texts with actual sentences, a lot of sentences, like explained in the README.md.

Yes I know, that would be the perfect choice, but that http://git.tuxfamily.org/okboard/okb...tree/README.md is hard to understand for human beings and I prefer to open a discussion here in order to create a forum where we can ask our doubts for each step. For example, I don't know where to look for the sentences you speak about, so maybe someone could give me some clue about it...

If you only point people to that http://git.tuxfamily.org/okboard/okb...tree/README.md and people don't understand it, we stuck for ever at the same point.

Thus perhaps among all of us here would be able to create a "how to" that anyone could understand. I think that this is the correct way to increase the support of much more OKBoard languages than the only 3 it stucks for 3 weeks. Do you understand me?

Quote & Reply |

ssahla	2016-01-07 , 22:52
Posts: 89 \| Thanked: 243 times \| Joined on Jun 2014	#175

Originally Posted by ferlanero

For example, I don't know where to look for the sentences you speak about, so maybe someone could give me some clue about it...

Quoting a message one page ago (http://talk.maemo.org/showpost.php?p...&postcount=158), emphasis mine:

To train the model you need to feed it with a huge volume of text. The text should be representative of the kind of text you will type.
For example is you use a Wikipedia corpus, the keyboard will be very uncooperative if you try to type informal text that would look unnatural in a Wikipedia article.

Building language files is not just a matter of pouring random text in the build tool or you will end up with a high error rate.
I recommend using a lot of text (my French corpus is over 40 million words, and in some cases this is not enough), and using different kind of documents: articles (new / wikipedia), e-mail, IRC and chat logs ...

Quote & Reply |

The Following 2 Users Say Thank You to ssahla For This Useful Post:
Feathers McGraw, juiceme

ferlanero	2016-01-07 , 23:04
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#176

I already have read the entire thread of the post and already have read the form we have to look for, but where to find it? I dont think the author would make it manually in the case of more than 40 million words

Quote & Reply |

ferlanero	2016-01-07 , 23:12
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#177

And other error.

If I do like the author writes in http://git.tuxfamily.org/okboard/okb...tree/README.md

### How to distribute language files
You have several options for distributing language files (`$LANG.tre`, `predict-$LANG.ng`, `predict-$LANG.db`):
* Just copy them to any Jolla device in `~/.local/share/okboard/`. When you switch language on the keyboard, new files will be avalable. No need to restart the keyboard.

When I activate OKBoard, the program automatically deletes the files (`$LANG.tre`, `predict-$LANG.ng`, `predict-$LANG.db`) in my case (`es.tre`, `predict-es.ng`, `predict-es.db`) form `~/.local/share/okboard/`

So maybe the README.md needs more support...

Last edited by ferlanero; 2016-01-07 at 23:18.

Quote & Reply |

ljo	2016-01-08 , 00:36
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#178

Originally Posted by ferlanero

And other error.

When I activate OKBoard, the program automatically deletes the files (`$LANG.tre`, `predict-$LANG.ng`, `predict-$LANG.db`) in my case (`es.tre`, `predict-es.ng`, `predict-es.db`) form `~/.local/share/okboard/`

So maybe the README.md needs more support...

Yes, that instruction should be removed. But for the language resource creation steps above that I already submitted a patch which made the process smoother to follow. Recreating major parts of the documentation in single entries here is not the way to do it.
We now found out your big problem, the question could be formulated like this: "How do I create a text corpus of my language of choice?"
Answer: You collect texts in Spanish according to any of the tutorials on the subject.
Alternative answer: You download texts of the types you highlighted, wikipedia dump, blog dumps, irc log dumps, sms conversation dumps etc. You then paste all texts together and remove all non-ascii/non-latin1 characters (e g with iconv, python, perl or other commandline tool) and follow the processing instructions in the README.md file.

If you find further problems, formulate a minimal example of where you get stuck. Don't speculate - just put it in a simple question form here. There might be a term for what you want to achieve. Wait with writing a new tutorial unless you find a totally unexplored area where a simple search on the web gives you nothing to link to.

Quote & Reply |

The Following 2 Users Say Thank You to ljo For This Useful Post:
Feathers McGraw, juiceme

ferlanero	2016-01-08 , 02:20
Posts: 105 \| Thanked: 205 times \| Joined on Dec 2015 @ Spain	#179

Ok. Thanks for the info. Now I can figure how OKBoard is working with the corpus files. But my question remain: Why OKBoard deletes the `es.tre`, `predict-es.ng`, `predict-es.db` files that I copy into `~/.local/share/okboard/` in Sailfish OS to test if it works, while in the creation process "db/build.sh es" doesn't give any errors?

Here is my log:

Code:

[ferlanero@ferlanero-imac okb-engine-master]$ db/build.sh es
Building for languages:  es
~/okb-engine-master/ngrams ~/okb-engine-master/db
running build
running build_ext
running build
running build_ext
~/okb-engine-master/db
~/okb-engine-master/cluster ~/okb-engine-master/db
make: No se hace nada para 'first'.
~/okb-engine-master/db
«/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf»
«/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf»
«/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf»
«/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf»
«/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt»
«/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version»
make: '.depend-es' está actualizado.
( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master ) | sort | uniq > es-full.dict
lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2
mv -vf es-learn.tmp.bz2 es-learn.txt.bz2
«es-learn.tmp.bz2» -> «es-learn.txt.bz2»
mv -vf es-test.tmp.bz2 es-test.txt.bz2
«es-test.tmp.bz2» -> «es-test.txt.bz2»
set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp
mv -f grams-es-full.csv.bz2.tmp grams-es-full.csv.bz2
set -o pipefail ; lbzip2 -d < grams-es-full.csv.bz2 | grep ';#NA;#NA;' | cut -f '1,4' -d';' \
 | grep -v '#TOTAL' | sort -rn | cut -d';' -f 2 | egrep -v '^(i)$' | tee words-es.txt \
         | sed -n "1,30000 p" > es-predict.dict.tmp  # ok i've re-implemented "head" with sed to avoid ugly sigpipes (which hurt with -o pipefail)
mv -f es-predict.dict.tmp es-predict.dict
set -o pipefail	; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-predict.dict | lbzip2 -9 > grams-es-learn.csv.bz2.tmp
/home/ferlanero/okb-engine-master/db/../tools/loadkb.py es-full.tre < es-full.dict
set -o pipefail ; lbzip2 -d < es-test.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-predict.dict | lbzip2 -9 > grams-es-test.csv.bz2.tmp
mv -f grams-es-learn.csv.bz2.tmp grams-es-learn.csv.bz2
Computing clusters for language es. Please make some coffee ...
 (logs can be found in clusters-es.log)
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 | sort -rn | sed -n "1,13500000 p" \
 | /home/ferlanero/okb-engine-master/db/../tools/cluster -n 10 -o clusters-es.tmp > clusters-es.log 2>&1
mv -f clusters-es.tmp clusters-es.txt
mv -f grams-es-test.csv.bz2.tmp grams-es-test.csv.bz2
1000
set -o pipefail ; lbzip2 -d < grams-es-learn.csv.bz2 \
 | /home/ferlanero/okb-engine-master/db/../tools/clusterize.py -l 8 -w 200000 -c 500000 clusters-es.txt \
 | tee predict-es.txt \
 | /home/ferlanero/okb-engine-master/db/../tools/load_cdb_fslm.py predict-es-tmp.db
Import CSV corpus data ...
Dumping compressed ngram file ...
Dumping words to database ...
2000
lbzip2 -9fv predict-es.txt
lbzip2: compressing "predict-es.txt" to "predict-es.txt.bz2"
lbzip2: "predict-es.txt": compression ratio is 1:2.274, space savings is 56.02%
/home/ferlanero/okb-engine-master/db/../tools/db_param.py predict-es-tmp.db version 11
lbzip2 -9f predict-es-tmp.rpt
mv -f predict-es-tmp.db predict-es.db
mv -f predict-es-tmp.ng predict-es.ng
mv -f predict-es-tmp.rpt.bz2 predict-es.rpt.bz2
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
/home/ferlanero/okb-engine-master/db/../tools/loadkb.py es.tre < words-es.txt  # all word seens in learn corpus (smaller than full directory, but bigger than prediction learning dictionary)
OK es
sending incremental file list
es-full.tre
es.tre
predict-es.db
predict-es.ng
predict-es.rpt.bz2

sent 2,423,995 bytes  received 111 bytes  4,848,212.00 bytes/sec
total size is 2,423,052  speedup is 1.00

Quote & Reply |

ljo	2016-01-08 , 07:17
Posts: 102 \| Thanked: 187 times \| Joined on Jan 2010	#180

Originally Posted by ferlanero

Ok. Thanks for the info. Now I can figure how OKBoard is working with the corpus files. But my question remain: Why OKBoard deletes the `es.tre`, `predict-es.ng`, `predict-es.db` files that I copy into `~/.local/share/okboard/` in Sailfish OS to test if it works, while in the creation process "db/build.sh es" doesn't give any errors?

Here is my log:

These two are completely independent things. I e the creation of the language resources and running the engine and keyboard, so no need for the log. There simply seems to be an asymmetry in the runtime setup when using the default language resources (en, fr, nl) and others, the latter are completely restored each time and this equals to removal if there are no files to restore for that language.
It should probably by symmetrical behavior, but you can argue for both behaviors. The language resource developer can easily make the extra steps of putting the resources in /usr/share/okboard/, while the end user will probably want any new resources installed to be used rather than the old one in ~/.local/share/okboard/. But a simple test if there are no resources for the language installed in /usr/share/okboard/ one just keep the local one will probably be good, but would add to complexity.