Menu

Main Menu
Talk Get Daily Search

Member's Online

    User Name
    Password

    Installing Encyclopedia???

    Reply
    Page 4 of 8 | Prev |   2     3   4   5     6   | Next | Last
    nobodysbusiness | # 31 | 2009-01-23, 03:42 | Report

    I plugged in a 2GB microsd card, through the same adapter. I then installed the GParted maemo hack from another thread and formatted the 2GB card as ext3. The only oddity was that when I opened the "Removable Memory Card" in the maemo file manager, it gave a "could not open" notification. I eventually figured out that this was due to permissions. Apparently, GParted runs as root, and creates a "lost+found" directory that is owned by root and not writable. So I ran:

    cd /media
    sudo gainroot
    chown -R user mmc1
    chgrp -R users mmc1

    And the error goes away. Now the 2GB card is working brilliantly with ext3, I'm going to do the same to the 8GB card, and then I can copy over the Wikipedia dump.

    Edit | Forward | Quote | Quick Reply | Thanks

     
    nobodysbusiness | # 32 | 2009-01-23, 16:46 | Report

    I've completed copying the Wikipedia dump over to the N810's 8GB, class 6 microsd card, and am now ready to do some timing. I'm using "xclock -update 1" to measure the number of seconds that certain operations take (I realize that this won't be very accurate, but hopefully enough to get a reasonable idea).

    Loading the dump reader program: 11 seconds
    Select the "articles.xml.bz2" file and click "OK": 4 minutes, 5-10 seconds
    The main "Wikipedia" page opens.
    Type in search string "edsger dijkstra" and hit "Go": Waited about 6 minutes before I stopped paying close attention, about 4 minutes later the expected page had loaded, giving a total time of about 10 minutes (ouch, though after loading, I was able to scroll up and down with the D-pad, and performance was quite snappy)
    Clicked a link in the article for "The University of Texas at Austin": After about 8 minutes, the new article opened up in the sidebar to the right, and I could switch to it by clicking on it (only takes about 5 seconds).

    Well, it seems that for the full enwiki, Wikipedia Dump Reader is very slow. I wonder why that is. Is the amount of data just too large to expect anything else, or is there room for optimization? (Perhaps switching from a linear to a logarithmic search, or something)

    I suppose I'll keep experimenting. There was a script somewhere for converting a Wikipedia dump file to Stardict format. Perhaps that would offer better performance.

    Edit | Forward | Quote | Quick Reply | Thanks

     
    mikkov | # 33 | 2009-01-23, 17:27 | Report

    Originally Posted by nobodysbusiness View Post
    Well, it seems that for the full enwiki, Wikipedia Dump Reader is very slow. I wonder why that is. Is the amount of data just too large to expect anything else, or is there room for optimization? (Perhaps switching from a linear to a logarithmic search, or something).
    The reason for slowness is that the search "algorithm" is something like
    Code:
    gzip -cdf  indexfile | grep  searchword
    So not very efficient

    I have a patched version which uses Xapian search engine. With xapian that search would probably be matter of seconds, but the search index size is also very (too) big.

    Edit | Forward | Quote | Quick Reply | Thanks

    Last edited by mikkov; 2009-01-23 at 17:32.

     
    mikkov | # 34 | 2009-01-23, 18:03 | Report

    Here is wikipediadumpreader with xapian support for adventurous people to try.

    1. apt-get install wikipediadumpreader
    2. apt-get install python2.5-xapian
    3. create index with unmodified dumpreader
    4. gunzip fiwiki-20080407-pages-articles.idx.gz
    4. create xapian index: "python xapian-index.py db < fiwiki-20080611-pages-articles.idx".
    5. replace /usr/share/wikipediadumpreader/dumpReader.py with dumpreader.py from tar package

    That should do it. I haven't tested this for couple of months so it might not even work.

    directory structure has to be:
    /whatever/fiwiki-20080407-pages-articles.xml.bz2
    /whatever/db/

    filenames are naturally just examples

    Edit | Forward | Quote | Quick Reply | Thanks
    Attached Files
    File Type: tar wikipediadumpreader_xapian.tar (30.0 KB, 92 views)
    The Following 3 Users Say Thank You to mikkov For This Useful Post:
    nobodysbusiness, qole, qwerty12

     
    nobodysbusiness | # 35 | 2009-01-23, 21:19 | Report

    I'll try creating the Xapian index and see how large it is. If the dump file is about 4GB and the index is about 3GB, then I could fit them both on my 8GB microsd card and my problems would be solved.

    I've also discovered aarddict, which promises "Sub-second response regardless of dictionary size". So hopefully one of these will work acceptably, even for the huge English Wikipedia.

    Edit | Forward | Quote | Quick Reply | Thanks

     
    allnameswereout | # 36 | 2009-01-23, 21:43 | Report

    xapian-index.py doesn't exist. There is a xapian.py though.

    Originally Posted by nobodysbusiness View Post
    I plugged in a 2GB microsd card, through the same adapter. I then installed the GParted maemo hack from another thread and formatted the 2GB card as ext3.
    You might want to take a look at /etc/fstab

    Edit | Forward | Quote | Quick Reply | Thanks

     
    nobodysbusiness | # 37 | 2009-01-25, 01:35 | Report

    I extracted xapian-index.py from the tar file and ran it on the Wikipedia dump that I mentioned earlier. After it had run, I was pleased to find that the DB was only about 1 GB! Unfortunately, I believe that the reason for such a small DB was actually an error:

    Title: 16606 83037 612 Categoryistricts of the Quispicanchi Province Words: Categoryistricts of the Quispicanchi Province
    Adding: Categoryistricts 0
    Adding: of 1
    Adding: the 2
    Adding: Quispicanchi 3
    Adding: Province 4
    Title: 16606 83658 783 Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy Words: Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy
    Adding: Acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy 0
    Exception: Term too long (> 245): acetylseryltyrosylserylisoleucylthreonylserylproly lserylglutaminylphenylalanylvalylphenylalanylleucy lserylserylvalyltryptophylalanylaspartylprolylisol eucylglutamylleucylleucylasparaginylvalylcysteinyl threonylserylserylleucylglycylasparaginylglutaminy

    Edit | Forward | Quote | Quick Reply | Thanks

     
    mikkov | # 38 | 2009-01-25, 21:54 | Report

    Hmm, it appears that the script didn't handle one of the longest words in english properly

    http://en.wikipedia.org/wiki/Acetyls...yliso...serine

    You could just remove that line from the articles.idx file or tweak the script to check the word length.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 2 Users Say Thank You to mikkov For This Useful Post:
    nobodysbusiness, qole

     
    qole | # 39 | 2009-01-26, 18:14 | Report

    So the entire 1185-letter word is included in the index?

    I contend that it isn't an English word at all, just a chemical formula written in a way that looks like a word.

    Originally Posted by
    It does hold the record for the longest word published in an English language publication in a serious context — that is, for some reason other than to publish a very long word...
    I will accept that. The longest "real" word in English is "Antidisestablishmentarianism", but there really are no opportunities to use it anymore now that church and state are disestablished...

    Edit | Forward | Quote | Quick Reply | Thanks

     
    Entonian | # 40 | 2009-02-05, 06:55 | Report

    The Aard Dictionary has Wikipedia in Aarddict format available via bittorrent. Torrent link is at http://aarddict.org/ under Wikipedia at the left.

    Edit | Forward | Quote | Quick Reply | Thanks
    The Following 3 Users Say Thank You to Entonian For This Useful Post:
    mikkov, nobodysbusiness, speculatrix

     
    Page 4 of 8 | Prev |   2     3   4   5     6   | Next | Last
vBulletin® Version 3.8.8
Normal Logout