Notices


Reply
Thread Tools
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#31
I plugged in a 2GB microsd card, through the same adapter. I then installed the GParted maemo hack from another thread and formatted the 2GB card as ext3. The only oddity was that when I opened the "Removable Memory Card" in the maemo file manager, it gave a "could not open" notification. I eventually figured out that this was due to permissions. Apparently, GParted runs as root, and creates a "lost+found" directory that is owned by root and not writable. So I ran:

cd /media
sudo gainroot
chown -R user mmc1
chgrp -R users mmc1

And the error goes away. Now the 2GB card is working brilliantly with ext3, I'm going to do the same to the 8GB card, and then I can copy over the Wikipedia dump.
 
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#32
I've completed copying the Wikipedia dump over to the N810's 8GB, class 6 microsd card, and am now ready to do some timing. I'm using "xclock -update 1" to measure the number of seconds that certain operations take (I realize that this won't be very accurate, but hopefully enough to get a reasonable idea).

Loading the dump reader program: 11 seconds
Select the "articles.xml.bz2" file and click "OK": 4 minutes, 5-10 seconds
The main "Wikipedia" page opens.
Type in search string "edsger dijkstra" and hit "Go": Waited about 6 minutes before I stopped paying close attention, about 4 minutes later the expected page had loaded, giving a total time of about 10 minutes (ouch, though after loading, I was able to scroll up and down with the D-pad, and performance was quite snappy)
Clicked a link in the article for "The University of Texas at Austin": After about 8 minutes, the new article opened up in the sidebar to the right, and I could switch to it by clicking on it (only takes about 5 seconds).

Well, it seems that for the full enwiki, Wikipedia Dump Reader is very slow. I wonder why that is. Is the amount of data just too large to expect anything else, or is there room for optimization? (Perhaps switching from a linear to a logarithmic search, or something)

I suppose I'll keep experimenting. There was a script somewhere for converting a Wikipedia dump file to Stardict format. Perhaps that would offer better performance.
 
Posts: 1,208 | Thanked: 1,028 times | Joined on Oct 2007
#33
Originally Posted by nobodysbusiness View Post
Well, it seems that for the full enwiki, Wikipedia Dump Reader is very slow. I wonder why that is. Is the amount of data just too large to expect anything else, or is there room for optimization? (Perhaps switching from a linear to a logarithmic search, or something).
The reason for slowness is that the search "algorithm" is something like
Code:
gzip -cdf  indexfile | grep  searchword
So not very efficient

I have a patched version which uses Xapian search engine. With xapian that search would probably be matter of seconds, but the search index size is also very (too) big.

Last edited by mikkov; 2009-01-23 at 17:32.
 
Posts: 1,208 | Thanked: 1,028 times | Joined on Oct 2007
#34
Here is wikipediadumpreader with xapian support for adventurous people to try.

1. apt-get install wikipediadumpreader
2. apt-get install python2.5-xapian
3. create index with unmodified dumpreader
4. gunzip fiwiki-20080407-pages-articles.idx.gz
4. create xapian index: "python xapian-index.py db < fiwiki-20080611-pages-articles.idx".
5. replace /usr/share/wikipediadumpreader/dumpReader.py with dumpreader.py from tar package

That should do it. I haven't tested this for couple of months so it might not even work.

directory structure has to be:
/whatever/fiwiki-20080407-pages-articles.xml.bz2
/whatever/db/

filenames are naturally just examples
Attached Files
File Type: tar wikipediadumpreader_xapian.tar (30.0 KB, 86 views)
 

The Following 3 Users Say Thank You to mikkov For This Useful Post:
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#35
I'll try creating the Xapian index and see how large it is. If the dump file is about 4GB and the index is about 3GB, then I could fit them both on my 8GB microsd card and my problems would be solved.

I've also discovered aarddict, which promises "Sub-second response regardless of dictionary size". So hopefully one of these will work acceptably, even for the huge English Wikipedia.
 
allnameswereout's Avatar
Posts: 3,397 | Thanked: 1,212 times | Joined on Jul 2008 @ Netherlands
#36
xapian-index.py doesn't exist. There is a xapian.py though.

Originally Posted by nobodysbusiness View Post
I plugged in a 2GB microsd card, through the same adapter. I then installed the GParted maemo hack from another thread and formatted the 2GB card as ext3.
You might want to take a look at /etc/fstab
__________________
Goosfraba! All text written by allnameswereout is public domain unless stated otherwise. Thank you for sharing your output!
 
Posts: 110 | Thanked: 52 times | Joined on Sep 2007
#37
I extracted xapian-index.py from the tar file and ran it on the Wikipedia dump that I mentioned earlier. After it had run, I was pleased to find that the DB was only about 1 GB! Unfortunately, I believe that the reason for such a small DB was actually an error:

Title: 16606 83037 612 Categoryistricts of the Quispicanchi Province Words: Categoryistricts of the Quispicanchi Province
Adding: Categoryistricts 0
Adding: of 1
Adding: the 2
Adding: Quispicanchi 3
Adding: Province 4
Title: 16606 83658 783 Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy Words: Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy
Adding: Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy 0
Exception: Term too long (> 245): acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy
 
Posts: 1,208 | Thanked: 1,028 times | Joined on Oct 2007
#38
Hmm, it appears that the script didn't handle one of the longest words in english properly

http://en.wikipedia.org/wiki/Acetyls...yliso...serine

You could just remove that line from the articles.idx file or tweak the script to check the word length.
 

The Following 2 Users Say Thank You to mikkov For This Useful Post:
qole's Avatar
Moderator | Posts: 7,109 | Thanked: 8,820 times | Joined on Oct 2007 @ Vancouver, BC, Canada
#39
So the entire 1185-letter word is included in the index?

I contend that it isn't an English word at all, just a chemical formula written in a way that looks like a word.

It does hold the record for the longest word published in an English language publication in a serious context — that is, for some reason other than to publish a very long word...
I will accept that. The longest "real" word in English is "Antidisestablishmentarianism", but there really are no opportunities to use it anymore now that church and state are disestablished...
__________________
qole.org --- twitter --- Easy Debian wiki page
Please don't send me a private message, post to the appropriate thread.
Thank you all for your donations!
 
Posts: 6 | Thanked: 9 times | Joined on Oct 2008
#40
The Aard Dictionary has Wikipedia in Aarddict format available via bittorrent. Torrent link is at http://aarddict.org/ under Wikipedia at the left.
 

The Following 3 Users Say Thank You to Entonian For This Useful Post:
Reply


 
Forum Jump


All times are GMT. The time now is 18:28.