nlp - Store large text corpus in Python -


i trying build large text corpus wikipedia dump. represent every article document object consisting of:

  • the original text: string
  • the preprocessed text: list of tuples, each tuple contains (stemmed) word , position of word in original text
  • some additional information title , author

i searching efficient way save these objects disk. following operations should possible:

  • adding new document
  • accessing documents via id
  • iterate on documents

it not necessary remove object once added.

i imagine following methods:

  • serializing each article separate file, example using pickle: downside here lots of operating system calls
  • store documents single xml file or blocks of documents several xml files: storing list represents preprocessed document in xml format uses lot of overhead , think quite slow read list xml
  • using existing package storing corpus: found corpora package, seems fast , efficient, supports storing strings plus header including metadata. putting preprocessed text header makes run incredibly slow.

what way this? maybe package purpose, have not found until now?


Comments

Popular posts from this blog

javascript - Karma not able to start PhantomJS on Windows - Error: spawn UNKNOWN -

Nuget pack csproj using nuspec -

c# - Display ASPX Popup control in RowDeleteing Event (ASPX Gridview) -