nlp - Store large text corpus in Python -
i trying build large text corpus wikipedia dump. represent every article document object consisting of:
- the original text: string
- the preprocessed text: list of tuples, each tuple contains (stemmed) word , position of word in original text
- some additional information title , author
i searching efficient way save these objects disk. following operations should possible:
- adding new document
- accessing documents via id
- iterate on documents
it not necessary remove object once added.
i imagine following methods:
- serializing each article separate file, example using pickle: downside here lots of operating system calls
- store documents single xml file or blocks of documents several xml files: storing list represents preprocessed document in xml format uses lot of overhead , think quite slow read list xml
- using existing package storing corpus: found corpora package, seems fast , efficient, supports storing strings plus header including metadata. putting preprocessed text header makes run incredibly slow.
what way this? maybe package purpose, have not found until now?
Comments
Post a Comment