|
|
2018-12-06
, 06:42
|
|
Posts: 36 |
Thanked: 118 times |
Joined on Nov 2018
|
#22
|
| The Following User Says Thank You to FlyingAntero For This Useful Post: | ||
|
|
2018-12-06
, 15:49
|
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#23
|
I don't know how it will work out. I have downloaded the files and uploaded them to the drive (65Gb). I can share a link if someone wants to try it out. If not then I might try with The National Library's journal's Finnish n-grams by myself because it is easier that way.
| The Following 2 Users Say Thank You to ljo For This Useful Post: | ||
|
|
2018-12-06
, 19:55
|
|
Posts: 1,414 |
Thanked: 7,547 times |
Joined on Aug 2016
@ Estonia
|
#24
|
| The Following 3 Users Say Thank You to rinigus For This Useful Post: | ||
|
|
2018-12-07
, 03:19
|
|
Posts: 36 |
Thanked: 118 times |
Joined on Nov 2018
|
#25
|
|
|
2018-12-07
, 09:07
|
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#26
|
| The Following 4 Users Say Thank You to ljo For This Useful Post: | ||
|
|
2018-12-11
, 11:48
|
|
Posts: 102 |
Thanked: 187 times |
Joined on Jan 2010
|
#27
|
| The Following 3 Users Say Thank You to ljo For This Useful Post: | ||
|
|
2018-12-12
, 09:29
|
|
Posts: 36 |
Thanked: 118 times |
Joined on Nov 2018
|
#28
|
. I can confirm that there is a hyphenation problem with some words. However, it is not a big problem in normal use since the issue seems to be linked to compound words. Here is few examples:| The Following User Says Thank You to FlyingAntero For This Useful Post: | ||
|
|
2018-12-12
, 10:03
|
|
Posts: 1,414 |
Thanked: 7,547 times |
Joined on Aug 2016
@ Estonia
|
#29
|
| The Following 2 Users Say Thank You to rinigus For This Useful Post: | ||
|
|
2018-12-12
, 10:39
|
|
Posts: 36 |
Thanked: 118 times |
Joined on Nov 2018
|
#30
|
Profanity is an issue and would be great to get rid of it. I had the same problem when composing the database for English, large fraction of the time was spent on that. I would suggest to filter the database and remove all n-grams that include any of the words that are classified as "bad". For that, we need a list of the words (possibly as substrings). That would have to be provided by native speakers though. Maybe such list is composed already somewhere...
| The Following User Says Thank You to FlyingAntero For This Useful Post: | ||
![]() |
| Tags |
| predictive text, presage, text-prediction |
|
I am pretty sure we'd get a Tay.ai type prediction engine out of that corpus!
Dave999: Meateo balloons. What’s so special with em? Is it a ballon?