nlp - Store large text corpus in Python -

nlp - Store large text corpus in Python -

- June 15, 2010

i trying build large text corpus wikipedia dump. represent every article document object consisting of:

the original text: string
the preprocessed text: list of tuples, each tuple contains (stemmed) word , position of word in original text
some additional information title , author

i searching efficient way save these objects disk. following operations should possible:

adding new document
accessing documents via id
iterate on documents

it not necessary remove object once added.

i imagine following methods:

serializing each article separate file, example using pickle: downside here lots of operating system calls
store documents single xml file or blocks of documents several xml files: storing list represents preprocessed document in xml format uses lot of overhead , think quite slow read list xml
using existing package storing corpus: found corpora package, seems fast , efficient, supports storing strings plus header including metadata. putting preprocessed text header makes run incredibly slow.

what way this? maybe package purpose, have not found until now?

Comments