Active Topics

 



Notices


Reply
Thread Tools
Benson's Avatar
Posts: 4,930 | Thanked: 2,272 times | Joined on Oct 2007
#21
If it's bz2ed at 4.1GB, you'll need to delete most of it. I'm not sure it would fit uncompressed in a 16GB SD (but I'm not going to download 4GB to check!).

It would be nice, BTW, for anyone with any of these multi-GB bundles to post back with the uncompressed size, so others know before they download. (I find it odd that that info is not being posted with the downloads; you'd think it would be a key number for actual use... but it's omitted on both download.wikimedia.org and www.soschildrensvillages.org)
 
allnameswereout's Avatar
Posts: 3,397 | Thanked: 1,212 times | Joined on Jul 2008 @ Netherlands
#22
Originally Posted by nobodysbusiness View Post
I have visited http://download.wikimedia.org/enwiki/latest/ and am currently downloading enwiki-latest-pages-articles.xml.bz2. It's 4.1 GB, so hopefully I can open the bz2 file and delete a few sections that I'm not interested in before copying it onto my 4 GB SD card.
I just use Wikipedia Offline Reader to read the .xml.bz2. When you run it first time, it will create an index (*.idx.gz and *.blocks.idx). This will take a while. I haven't looked at the code, but if it was programmed smart it takes advantage of bunzip2 being able to uncompress specific bytes of data.

Although it isn't fast on a 350 MB .xml.bz2, with a look up taking a guesstimate of 15 seconds, it does the job, while being offline. It might not be for the English Wikipedia because this one is more than 10x as big, but I do not expect 1000% overhead either.

I'm not using StarDict here, I just used the image directly from download.wikimedia.org. StarDict is probably way faster because it converts the xml to sqlite but you'll have to convert the image first on a normal, fast computer.

A newer version of this encyclopedia is 402 MB big as *.xml.bz2. It extracts to 1,95 GB *.xml. This is more than 4 times as much, and its all utf-8/unicode text.
__________________
Goosfraba! All text written by allnameswereout is public domain unless stated otherwise. Thank you for sharing your output!
 

The Following 2 Users Say Thank You to allnameswereout For This Useful Post:
Benson's Avatar
Posts: 4,930 | Thanked: 2,272 times | Joined on Oct 2007
#23
Originally Posted by allnameswereout View Post
I just use Wikipedia Offline Reader to read the .xml.bz2. When you run it first time, it will create an index (*.idx.gz and *.blocks.idx). This will take a while. I haven't looked at the code, but if it was programmed smart it takes advantage of bunzip2 being able to uncompress specific bytes of data.
Cool. Slowness is not nice, but fitting things into a reasonable size is. I'd have thought the performance hit would be punitive.... 15 seconds... Oh, it is!

I have to wonder, though, what the size penalty would be to bz2 each article and tar the whole wad (I realize it's not an indexed tarball currently, but it could be.); should get markedly faster decompression, but some worse space.
A newer version of this encyclopedia is 402 MB big as *.xml.bz2. It extracts to 1,95 GB *.xml. This is more than 4 times as much, and its all utf-8/unicode text.
Yeah, so I guess pick up that 32GB card if you want to decompress the English one. N800 ftw!
 
Posts: 1,208 | Thanked: 1,028 times | Joined on Oct 2007
#24
Originally Posted by allnameswereout View Post
Although it isn't fast on a 350 MB .xml.bz2, with a look up taking a guesstimate of 15 seconds, it does the job, while being offline. It might not be for the English Wikipedia because this one is more than 10x as big, but I do not expect 1000% overhead either.
I have actually tested xapian search engine with wikipediadumpreader and it is _much_ faster, but the index file is also almost as big as the bz2 file too
 

The Following User Says Thank You to mikkov For This Useful Post:
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#25
I installed the Wikipedia dump reader on my Linux box, and opened the dump file there, to create the index quickly, as suggested before in the thread. Two new files were created: enwiki-latest-pages-articles.blocks.idx (385KB) and enwiki-latest-pages-articles.idx.gz (125MB). Does this mean that I can copy these three files to an 8GB microSD (with adapter) card and I will be able to read the Wikipedia offline on the N810? Has anyone else tried it with a dump this large? How slow is it to lookup a page? (BTW, on my big laptop, creating the index files only took about 2 hours for all 4.1 GB. Based on someone else's comment above, it seems that creating an index for this file on the N810 would take about 4 days. Does this seem about right?)
 
allnameswereout's Avatar
Posts: 3,397 | Thanked: 1,212 times | Joined on Jul 2008 @ Netherlands
#26
Originally Posted by nobodysbusiness View Post
I installed the Wikipedia dump reader on my Linux box, and opened the dump file there, to create the index quickly, as suggested before in the thread. Two new files were created: enwiki-latest-pages-articles.blocks.idx (385KB) and enwiki-latest-pages-articles.idx.gz (125MB). Does this mean that I can copy these three files to an 8GB microSD (with adapter) card and I will be able to read the Wikipedia offline on the N810? Has anyone else tried it with a dump this large? How slow is it to lookup a page? (BTW, on my big laptop, creating the index files only took about 2 hours for all 4.1 GB. Based on someone else's comment above, it seems that creating an index for this file on the N810 would take about 4 days. Does this seem about right?)
Good point on creating the database on a dedicated computer.

Yes, you must move all 3 files.

But for this database you must use Ext3 because FAT32 (vfat) has max file size of of 4 GiB while your database is 4.1 GB.

If you can grab an older version which is smaller than 4 GiB you're set on FAT32 with the disadvantage being your database not recent.

One can also use Simple English. This one is much smaller.
__________________
Goosfraba! All text written by allnameswereout is public domain unless stated otherwise. Thank you for sharing your output!
 

The Following User Says Thank You to allnameswereout For This Useful Post:
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#27
I submitted an order for a new Class 6 8GB microSD card. Once it arrives, I'll copy the files over and let you all know how long it takes to search and retrieve an article in the full enwiki.
 
allnameswereout's Avatar
Posts: 3,397 | Thanked: 1,212 times | Joined on Jul 2008 @ Netherlands
#28
Good. FWIW, I didn't use SD. I used the N810 internal 2 GB, FAT32 formatted. But I can test on class 6 8 GB microSD if there is demand.
__________________
Goosfraba! All text written by allnameswereout is public domain unless stated otherwise. Thank you for sharing your output!
 
qole's Avatar
Moderator | Posts: 7,109 | Thanked: 8,820 times | Joined on Oct 2007 @ Vancouver, BC, Canada
#29
Originally Posted by Benson View Post
It would be nice, BTW, for anyone with any of these multi-GB bundles to post back with the uncompressed size, so others know before they download. (I find it odd that that info is not being posted with the downloads; you'd think it would be a key number for actual use... but it's omitted on both download.wikimedia.org and www.soschildrensvillages.org)
The SOS Children's Villages version suggests that their sanitized, expurgated version is around 4.5GB uncompressed:

It has about 5500 articles (as much as can be fitted on a DVD with good size images) and is about the size of a twenty volume encyclopaedia (34,000 images and 20 million words).
__________________
qole.org --- twitter --- Easy Debian wiki page
Please don't send me a private message, post to the appropriate thread.
Thank you all for your donations!
 

The Following User Says Thank You to qole For This Useful Post:
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#30
I have received the new 8GB microsd card, and the stock fat32 partition is recognized by Maemo when I plug it in (through the adapter). That's quite a relief, as some people seem to have had difficulty with adapters and microsd in the past. Anyway, I'm having some trouble figuring out how to partition the new card with ext3, so that it can take large files. Based on reading various other threads, it seems that the easiest way to go about this would be to install gparted for maemo and then do the format graphically. However, I'm a little uncertain about taking this step. Do I have to do anything other than format the card with gparted? Will maemo recognize the ext3-formatted card as soon as I plug it in, like it does with fat32?
 
Reply


 
Forum Jump


All times are GMT. The time now is 11:23.