Menu

Main Menu
Talk Get Daily Search

Member's Online

    User Name
    Password
    Poll: What advanced text entry method(s) would you like to see on Sailfish?
    Poll Options
    What advanced text entry method(s) would you like to see on Sailfish?
    View Poll Results

    Advanced text entry on Sailfish (Swype or similar)

    Reply
    Page 19 of 38 | Prev | 9   17     18   19   20     21   29 | Next | Last
    ferlanero | # 181 | 2016-01-08, 12:26 | Report

    Thank you again. Now I already figure how this files works with OKBoard. Now I have another 2 questions:

    - When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?

    - And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...

    - And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?

    Thanks in advice folks!

    Edit | Forward | Quote | Quick Reply | Thanks

     
    ljo | # 182 | 2016-01-08, 12:43 | Report

    Originally Posted by ferlanero View Post
    Thank you again. Now I already figure how this files works with OKBoard. Now I have another 2 questions:

    1 - When I add corpora files, have I to copy all the sentences and words into one file and then compress it to bz2 or can I choose multiple files, compress them into one bz2 and then run db/build.sh?

    2 - And another Q, which tools do you use to correct format that dumps of files? There is an easy way to do that? Normally I have to remove the first columns of the files, but another times I have to remove the last characters of those files... So I don't know how to make it properly...

    3- And regarding this, the correct dump files format for OKBoard sentences must to end with dots and between these sentences there must to be blank lines or these are unnecessary?

    Thanks in advice folks!
    1 - One file. Just use cat to concatenate them all together:
    cat file1 file2 file3 file4 file5 > corpus-es.txt
    2 - iconv, sed, perl, python.
    3 - It is either interpunctation or empty rows inbetween that is needed, not both.

    Edit | Forward | Quote | Quick Reply | Thanks

     
    ferlanero | # 183 | 2016-01-12, 14:17 | Report

    Hi again guys!

    After finding all the necessary databases corpus and adjust them to the requisites of OKBoard, the process gives me this error, which I don't know how to solve. Any ideas, please?

    Code:
    Traceback (most recent call last):
      File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module>
        line = sys.stdin.readline()
      File "/usr/lib/python3.5/codecs.py", line 321, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
    /home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
    make: *** [grams-es-full.csv.bz2] Error 1
    How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?

    Code:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte

    Edit | Forward | Quote | Quick Reply | Thanks

     
    ferlanero | # 184 | 2016-01-12, 14:18 | Report

    Maybe the full log helps in any order:

    Code:
    [ferlanero@ferlanero-imac okb-engine-master]$ db/build.sh esBuilding for languages:  es
    ~/okb-engine-master/ngrams ~/okb-engine-master/db
    running build
    running build_ext
    running build
    running build_ext
    ~/okb-engine-master/db
    ~/okb-engine-master/cluster ~/okb-engine-master/db
    make: No se hace nada para 'first'.
    ~/okb-engine-master/db
    «/home/ferlanero/okb-engine-master/db/lang-en.cf» -> «/home/ferlanero/okboard/langs/lang-en.cf»
    «/home/ferlanero/okb-engine-master/db/lang-es.cf» -> «/home/ferlanero/okboard/langs/lang-es.cf»
    «/home/ferlanero/okb-engine-master/db/lang-fr.cf» -> «/home/ferlanero/okboard/langs/lang-fr.cf»
    «/home/ferlanero/okb-engine-master/db/lang-nl.cf» -> «/home/ferlanero/okboard/langs/lang-nl.cf»
    «/home/ferlanero/okb-engine-master/db/add-words-fr.txt» -> «/home/ferlanero/okboard/langs/add-words-fr.txt»
    «/home/ferlanero/okb-engine-master/db/db.version» -> «/home/ferlanero/okboard/langs/db.version»
    make: '.depend-es' está actualizado.
    ( [ -f "add-words-es.txt" ] && cat "add-words-es.txt" ; aspell -l es dump master ) | sort | uniq > es-full.dict
    lbzip2 -d < /home/ferlanero/okboard/langs/corpus-es.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/corpus-splitter.pl 200 50 es-learn.tmp.bz2 es-test.tmp.bz2
    mv -vf es-learn.tmp.bz2 es-learn.txt.bz2
    «es-learn.tmp.bz2» -> «es-learn.txt.bz2»
    mv -vf es-test.tmp.bz2 es-test.txt.bz2
    «es-test.tmp.bz2» -> «es-test.txt.bz2»
    set -o pipefail ; lbzip2 -d < es-learn.txt.bz2 | /home/ferlanero/okb-engine-master/db/../tools/import_corpus.py es-full.dict | sort -rn | lbzip2 -9 > grams-es-full.csv.bz2.tmp
    [5] 597/56329 words, 3232 n-grams, read 1 MB
    [10] 739/56329 words, 5115 n-grams, read 3 MB
    [15] 821/56329 words, 6611 n-grams, read 4 MB
    [20] 880/56329 words, 7950 n-grams, read 6 MB
    [25] 938/56329 words, 9167 n-grams, read 8 MB
    [30] 988/56329 words, 10184 n-grams, read 9 MB
    [35] 1023/56329 words, 11131 n-grams, read 11 MB
    [40] 1064/56329 words, 12179 n-grams, read 13 MB
    [45] 1091/56329 words, 13063 n-grams, read 14 MB
    [50] 1118/56329 words, 13953 n-grams, read 16 MB
    [55] 1135/56329 words, 14789 n-grams, read 18 MB
    [60] 1157/56329 words, 15571 n-grams, read 19 MB
    [65] 1178/56329 words, 16394 n-grams, read 21 MB
    [70] 1192/56329 words, 17120 n-grams, read 23 MB
    [75] 1207/56329 words, 17834 n-grams, read 24 MB
    [80] 1218/56329 words, 18545 n-grams, read 26 MB
    [85] 1231/56329 words, 19251 n-grams, read 28 MB
    [90] 1246/56329 words, 19947 n-grams, read 30 MB
    [95] 1257/56329 words, 20578 n-grams, read 31 MB
    [100] 1272/56329 words, 21158 n-grams, read 33 MB
    [105] 1282/56329 words, 21716 n-grams, read 35 MB
    [110] 1291/56329 words, 22330 n-grams, read 36 MB
    [115] 1301/56329 words, 22881 n-grams, read 38 MB
    [120] 1313/56329 words, 23434 n-grams, read 40 MB
    [125] 1319/56329 words, 24057 n-grams, read 41 MB
    [130] 1332/56329 words, 24653 n-grams, read 43 MB
    [135] 1339/56329 words, 25191 n-grams, read 45 MB
    [140] 1344/56329 words, 25706 n-grams, read 46 MB
    [145] 1350/56329 words, 26212 n-grams, read 48 MB
    [150] 1357/56329 words, 26724 n-grams, read 50 MB
    [155] 1364/56329 words, 27268 n-grams, read 51 MB
    [160] 1372/56329 words, 27804 n-grams, read 53 MB
    [165] 1380/56329 words, 28273 n-grams, read 55 MB
    [170] 1382/56329 words, 28785 n-grams, read 56 MB
    [175] 1384/56329 words, 29261 n-grams, read 58 MB
    [180] 1387/56329 words, 29778 n-grams, read 60 MB
    [185] 1393/56329 words, 30224 n-grams, read 61 MB
    [190] 1397/56329 words, 30689 n-grams, read 63 MB
    [195] 1403/56329 words, 31129 n-grams, read 65 MB
    [200] 1407/56329 words, 31507 n-grams, read 66 MB
    [205] 1410/56329 words, 31922 n-grams, read 68 MB
    [210] 1413/56329 words, 32384 n-grams, read 70 MB
    [215] 1421/56329 words, 32810 n-grams, read 71 MB
    [220] 1422/56329 words, 33225 n-grams, read 73 MB
    [225] 1428/56329 words, 33679 n-grams, read 75 MB
    [230] 1437/56329 words, 34144 n-grams, read 76 MB
    [235] 1442/56329 words, 34627 n-grams, read 78 MB
    [240] 1447/56329 words, 35082 n-grams, read 80 MB
    [245] 1451/56329 words, 35460 n-grams, read 82 MB
    [250] 1453/56329 words, 35870 n-grams, read 83 MB
    [255] 1458/56329 words, 36247 n-grams, read 85 MB
    [260] 1462/56329 words, 36619 n-grams, read 87 MB
    [265] 1467/56329 words, 36977 n-grams, read 88 MB
    [270] 1474/56329 words, 37412 n-grams, read 90 MB
    [275] 1477/56329 words, 37789 n-grams, read 92 MB
    [280] 1478/56329 words, 38156 n-grams, read 93 MB
    [285] 1479/56329 words, 38555 n-grams, read 95 MB
    [290] 1482/56329 words, 38947 n-grams, read 97 MB
    [295] 1487/56329 words, 39360 n-grams, read 98 MB
    [300] 1490/56329 words, 39767 n-grams, read 100 MB
    [305] 1495/56329 words, 40150 n-grams, read 102 MB
    [310] 1499/56329 words, 40525 n-grams, read 103 MB
    [315] 1501/56329 words, 40898 n-grams, read 105 MB
    [320] 1507/56329 words, 41346 n-grams, read 107 MB
    [325] 1514/56329 words, 41762 n-grams, read 108 MB
    [330] 1517/56329 words, 42151 n-grams, read 110 MB
    [335] 1518/56329 words, 42552 n-grams, read 112 MB
    [340] 1520/56329 words, 42987 n-grams, read 113 MB
    [345] 1521/56329 words, 43382 n-grams, read 115 MB
    [350] 1522/56329 words, 43798 n-grams, read 117 MB
    [355] 1525/56329 words, 44180 n-grams, read 118 MB
    [360] 1529/56329 words, 44556 n-grams, read 120 MB
    [365] 1532/56329 words, 44890 n-grams, read 122 MB
    [370] 1532/56329 words, 45264 n-grams, read 123 MB
    [375] 1534/56329 words, 45631 n-grams, read 125 MB
    [380] 1541/56329 words, 46036 n-grams, read 127 MB
    [385] 1544/56329 words, 46406 n-grams, read 128 MB
    [390] 1547/56329 words, 46804 n-grams, read 130 MB
    [395] 1550/56329 words, 47161 n-grams, read 132 MB
    [400] 1551/56329 words, 47534 n-grams, read 133 MB
    [405] 1552/56329 words, 47836 n-grams, read 135 MB
    [410] 1555/56329 words, 48154 n-grams, read 137 MB
    [415] 1560/56329 words, 48481 n-grams, read 138 MB
    [420] 1563/56329 words, 48868 n-grams, read 140 MB
    [425] 1564/56329 words, 49190 n-grams, read 142 MB
    [430] 1566/56329 words, 49514 n-grams, read 143 MB
    [435] 1569/56329 words, 49852 n-grams, read 145 MB
    [440] 1571/56329 words, 50169 n-grams, read 147 MB
    [445] 1573/56329 words, 50516 n-grams, read 148 MB
    [450] 1574/56329 words, 50831 n-grams, read 150 MB
    [455] 1576/56329 words, 51170 n-grams, read 152 MB
    [460] 1578/56329 words, 51518 n-grams, read 153 MB
    [465] 1578/56329 words, 51871 n-grams, read 155 MB
    [470] 1582/56329 words, 52191 n-grams, read 157 MB
    [475] 1584/56329 words, 52514 n-grams, read 158 MB
    [480] 1586/56329 words, 52882 n-grams, read 160 MB
    [485] 1590/56329 words, 53211 n-grams, read 162 MB
    [490] 1594/56329 words, 53547 n-grams, read 163 MB
    [495] 1597/56329 words, 53868 n-grams, read 165 MB
    [500] 1600/56329 words, 54210 n-grams, read 167 MB
    [505] 1603/56329 words, 54490 n-grams, read 168 MB
    [510] 1605/56329 words, 54821 n-grams, read 170 MB
    [515] 1607/56329 words, 55123 n-grams, read 172 MB
    [520] 1609/56329 words, 55434 n-grams, read 173 MB
    [525] 1611/56329 words, 55707 n-grams, read 175 MB
    [530] 1613/56329 words, 55987 n-grams, read 177 MB
    [535] 1616/56329 words, 56308 n-grams, read 178 MB
    [540] 1618/56329 words, 56606 n-grams, read 180 MB
    [545] 1621/56329 words, 56924 n-grams, read 182 MB
    [550] 1621/56329 words, 57210 n-grams, read 183 MB
    [555] 1623/56329 words, 57517 n-grams, read 185 MB
    [560] 1626/56329 words, 57839 n-grams, read 187 MB
    [565] 1630/56329 words, 58160 n-grams, read 188 MB
    [570] 1632/56329 words, 58426 n-grams, read 190 MB
    [575] 1634/56329 words, 58729 n-grams, read 192 MB
    [580] 1634/56329 words, 59014 n-grams, read 193 MB
    [585] 1637/56329 words, 59330 n-grams, read 195 MB
    [590] 1639/56329 words, 59659 n-grams, read 197 MB
    [595] 1641/56329 words, 59974 n-grams, read 198 MB
    [600] 1645/56329 words, 60303 n-grams, read 200 MB
    [605] 1647/56329 words, 60667 n-grams, read 202 MB
    [610] 1647/56329 words, 60986 n-grams, read 203 MB
    [615] 1651/56329 words, 61343 n-grams, read 205 MB
    [620] 1651/56329 words, 61634 n-grams, read 207 MB
    [625] 1660/56329 words, 61936 n-grams, read 208 MB
    [630] 1663/56329 words, 62284 n-grams, read 210 MB
    [635] 1666/56329 words, 62616 n-grams, read 212 MB
    [640] 1666/56329 words, 62920 n-grams, read 213 MB
    [645] 1670/56329 words, 63266 n-grams, read 215 MB
    [650] 1675/56329 words, 63590 n-grams, read 217 MB
    [655] 1679/56329 words, 63947 n-grams, read 218 MB
    [660] 1682/56329 words, 64242 n-grams, read 220 MB
    [665] 1684/56329 words, 64586 n-grams, read 222 MB
    [670] 1685/56329 words, 64909 n-grams, read 223 MB
    [675] 1688/56329 words, 65219 n-grams, read 225 MB
    [680] 1693/56329 words, 65512 n-grams, read 227 MB
    [685] 1694/56329 words, 65789 n-grams, read 228 MB
    [690] 1695/56329 words, 66081 n-grams, read 230 MB
    [695] 1697/56329 words, 66379 n-grams, read 232 MB
    [700] 1698/56329 words, 66711 n-grams, read 233 MB
    [705] 1699/56329 words, 67054 n-grams, read 235 MB
    [710] 1703/56329 words, 67392 n-grams, read 237 MB
    [715] 1705/56329 words, 67674 n-grams, read 238 MB
    [720] 1706/56329 words, 67996 n-grams, read 240 MB
    [725] 1709/56329 words, 68337 n-grams, read 242 MB
    [730] 1710/56329 words, 68607 n-grams, read 243 MB
    [735] 1711/56329 words, 68893 n-grams, read 245 MB
    [740] 1713/56329 words, 69135 n-grams, read 246 MB
    [745] 1713/56329 words, 69465 n-grams, read 248 MB
    [750] 1716/56329 words, 69765 n-grams, read 250 MB
    [755] 1718/56329 words, 70053 n-grams, read 251 MB
    [760] 1719/56329 words, 70303 n-grams, read 253 MB
    [765] 1720/56329 words, 70617 n-grams, read 255 MB
    [770] 1722/56329 words, 70927 n-grams, read 256 MB
    [775] 1724/56329 words, 71187 n-grams, read 258 MB
    Traceback (most recent call last):
      File "/home/ferlanero/okb-engine-master/db/../tools/import_corpus.py", line 210, in <module>
        line = sys.stdin.readline()
      File "/usr/lib/python3.5/codecs.py", line 321, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
    /home/ferlanero/okb-engine-master/db/makefile:43: fallo en las instrucciones para el objetivo 'grams-es-full.csv.bz2'
    make: *** [grams-es-full.csv.bz2] Error 1
    [ferlanero@ferlanero-imac okb-engine-master]$

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following User Says Thank You to ferlanero For This Useful Post:
    eber42

     
    ljo | # 185 | 2016-01-12, 22:25 | Report

    Originally Posted by ferlanero View Post
    Hi again guys!

    How I can find the character that gives the error in the original corpus-es file? I mean, this instruction points to somewhere in that file?

    Code:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte
    @ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
    You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands.

    Or pm me with proper urls to download your selected resources, this is true for all of you who have made an effort and collected enough useful resources to build the required model data for your language, and I will maintain the language package for the community. I though will reject any request not containing at least 40 million running words from several types of source materials like discussed earlier in this thread.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    Feathers McGraw, ferlanero, ssahla

     
    ferlanero | # 186 | 2016-01-13, 11:18 | Report

    Originally Posted by ljo View Post
    @ferlanero, it says the first byte of a multibyte character is wrong. So probably some faulty utf-8 encoding.
    You either find a way to search for the named byte by its hexadecimal value or convert to some base your tool (unless perl, python, grep) understands..
    Great! I found it! Now I can continue with the process. Thank ljo.

    About the data I have gathered, yes, I have tons of them, including many, many, diferent sources. So I decided to process all of them and when I do it, I would like your support to make a proper rpm to add support to Spanish language for OKBoard. It'll be posible?

    I have about 3Gb of corpora data so maybe it takes me a little bit to process everything...

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ferlanero For This Useful Post:
    eber42, Feathers McGraw, malkavian

     
    ljo | # 187 | 2016-01-13, 14:57 | Report

    Originally Posted by ferlanero View Post
    Great! I found it! Now I can continue with the process. Thank ljo.
    Great!
    NB Don't forget to use the Thanks!-link in the replies since that is the formal way to send thanks in this forum.

    Originally Posted by ferlanero View Post
    About the data I have gathered, yes, I have tons of them, including many, many, diferent sources. So I decided to process all of them
    Good, lets hope the mix is ok then to make good predictions.

    Originally Posted by ferlanero View Post
    and when I do it, I would like your support to make a proper rpm to add support to Spanish language for OKBoard. It'll be posible?
    Yes, that is possible.

    Originally Posted by ferlanero View Post
    I have about 3Gb of corpora data so maybe it takes me a little bit to process everything...
    Sounds like you will end up with a good language resource for Spanish.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    eber42, ferlanero, juiceme

     
    eber42 | # 188 | 2016-01-13, 19:22 | Report

    @ferlanero, according to your logs, the 258 first MB of your text corpus only use 1700 unique words (which is roughly the vocabulary of a 2 year old child). The n-gram count is incredibly low also.

    The corpus URL you quoted should be all right, so maybe there was a issue when you converted them.
    Could you share your input file (or at least a sample) ?

    For everybody, I recommend using the Opensubtitles.org corpus collected by OPUS. It should be more relevant for casual chat and could be used together with more formal text samples (news, books, wikipedia ...)

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 4 Users Say Thank You to eber42 For This Useful Post:
    Feathers McGraw, ferlanero, Jordi, juiceme

     
    ljo | # 189 | 2016-01-13, 22:57 | Report

    @eber42, yes, the OPUS corpus is compiled by a former colleague of mine. It is good for a lot of things.
    Let us see what @ferlanero comes up with.
    Could the ones who waived their their hands for producing resources for German please tell their status? Otherwise I will do one on February 1st.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to ljo For This Useful Post:
    eber42, juiceme, velox

     
    ferlanero | # 190 | 2016-01-14, 00:21 | Report

    Originally Posted by eber42 View Post
    @ferlanero, according to your logs, the 258 first MB of your text corpus only use 1700 unique words (which is roughly the vocabulary of a 2 year old child). The n-gram count is incredibly low also.
    The file of that log is 1Gb larger and is the result of merge news, wikipedia and newsscrawl from 2006 to 2011 from Uni-Leipzig. As the summation of another files from other sources increase the final size of the file incredible I decided to split it in several files to check it separately before processing altogether. As I'm having several problems with the UTF8 encoding I have to check each corpus separately. For example, the one that contains the colloquial speak, I already have checked it and is ready to process. The same occurs with the acedemic corpus and a dictionary. But as I said, my main trouble is with the erros with the UTF - ASCII codification

    Originally Posted by eber42 View Post
    The corpus URL you quoted should be all right, so maybe there was a issue when you converted them.
    Could you share your input file (or at least a sample) ?
    If you want, I don't have any problem sharing my files in order to check it's validity or not and to correct any errors they have. As I said I have enough processing power to help adding more languages support to OKBoard. So I hope your guidelines to do that correctly.

    I send you a PM with the input file.

    Thanks for your supoprt and your work with OKBoard

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 4 Users Say Thank You to ferlanero For This Useful Post:
    eber42, juiceme, ljo, velox

     
    Page 19 of 38 | Prev | 9   17     18   19   20     21   29 | Next | Last
vBulletin® Version 3.8.8
Normal Logout