Transposed parameter in Matrix Market Format of gensim - python -


in gensim library, there mmreader class converts matrix market format file python object. necessary transpose matrix, hence transposed parameter introduced in mmreader.

however, confused why @ lines 525-526 , 567-568 of https://github.com/piskvorky/gensim/blob/develop/gensim/matutils.py , inversion of term-document values , id happens when transposed == false.

anyone familiar term-document matrices in information retrieval care enlighten me?

class mmreader(object):     """     wrap term-document matrix on disk (in matrix-market format), , present     object supports iteration on rows (~documents).      note file read memory 1 document @ time, not whole     matrix @ once (unlike scipy.io.mmread). allows process corpora     larger available ram.     """     def __init__(self, input, transposed=true):         """         initialize matrix reader.          `input` refers file on local filesystem, expected         in sparse (coordinate) matrix market format. documents assumed         rows of matrix (and document features columns).          `input` either string (file path) or file-like object supports         `seek()` (e.g. gzip.gzipfile, bz2.bz2file).         """         logger.info("initializing corpus reader %s" % input)         self.input, self.transposed = input, transposed         if isinstance(input, basestring):             input = open(input)         header = input.next().strip()         if not header.lower().startswith('%%matrixmarket matrix coordinate real general'):             raise valueerror("file %s not in matrix market format coordinate real general; instead found: \n%s" %                              (self.input, header))         self.num_docs = self.num_terms = self.num_nnz = 0         lineno, line in enumerate(input):             if not line.startswith('%'):                 self.num_docs, self.num_terms, self.num_nnz = map(int, line.split())                 if not self.transposed: ## line 525                     self.num_docs, self.num_terms = self.num_terms, self.num_docs                 break         logger.info("accepted corpus %i documents, %i features, %i non-zero entries" %                      (self.num_docs, self.num_terms, self.num_nnz))      def __len__(self):         return self.num_docs      def __str__(self):         return ("mmcorpus(%i documents, %i features, %i non-zero entries)" %                 (self.num_docs, self.num_terms, self.num_nnz))      def skip_headers(self, input_file):         """         skip file headers appear before first document.         """         line in input_file:             if line.startswith('%'):                 continue             break      def __iter__(self):         """         iteratively yield vectors underlying file, in format (row_no, vector),         vector list of (col_no, value) 2-tuples.          note total number of vectors returned equal         number of rows specified in header; empty documents inserted ,         yielded appropriate, if not explicitly stored in         matrix market file.         """         if isinstance(self.input, basestring):             fin = open(self.input)         else:             fin = self.input             fin.seek(0)         self.skip_headers(fin)          previd = -1         line in fin:             docid, termid, val = line.split()             if not self.transposed:                 termid, docid = docid, termid             docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) # -1 because matrix market indexes 1-based => convert 0-based             assert previd <= docid, "matrix columns must come in ascending order"             if docid != previd:                 # change of document: return document read far (its id previd)                 if previd >= 0:                     yield previd, document                  # return implicit (empty) documents between previous id , new id                 # too, keep consistent document numbering , corpus length                 previd in xrange(previd + 1, docid):                     yield previd, []                  # on start adding fields new document, new id                 previd = docid                 document = []              document.append((termid, val,)) # add field current document          # handle last document, special case         if previd >= 0:             yield previd, document          # return empty documents between last explicit document , number         # of documents specified in header         previd in xrange(previd + 1, self.num_docs):             yield previd, []       def docbyoffset(self, offset):         """return document @ file offset `offset` (in bytes)"""         # empty documents not stored explicitly in mm format, index marks         # them special offset, -1.         if offset == -1:             return []         if isinstance(self.input, basestring):             fin = open(self.input)         else:             fin = self.input          fin.seek(offset) # works gzip/bz2 input,         previd, document = -1, []         line in fin:             docid, termid, val = line.split()             if not self.transposed: ## line 567                 termid, docid = docid, termid             docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) # -1 because matrix market indexes 1-based => convert 0-based             assert previd <= docid, "matrix columns must come in ascending order"             if docid != previd:                 if previd >= 0:                     return document                 previd = docid              document.append((termid, val,)) # add field current document         return document #endclass mmreader 

apparently, transposed parameter never used in latest version of gensim format of mmreader , mmwriter same.

for more details, developer explained in https://groups.google.com/forum/?hl=en#!topic/gensim/xc7q_q3wcyq


Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -