Transposed parameter in Matrix Market Format of gensim - python -
in gensim
library, there mmreader
class converts matrix market format file python object. necessary transpose matrix, hence transposed parameter introduced in mmreader
.
however, confused why @ lines 525-526
, 567-568
of https://github.com/piskvorky/gensim/blob/develop/gensim/matutils.py , inversion of term-document values , id happens when transposed == false
.
anyone familiar term-document matrices in information retrieval care enlighten me?
class mmreader(object): """ wrap term-document matrix on disk (in matrix-market format), , present object supports iteration on rows (~documents). note file read memory 1 document @ time, not whole matrix @ once (unlike scipy.io.mmread). allows process corpora larger available ram. """ def __init__(self, input, transposed=true): """ initialize matrix reader. `input` refers file on local filesystem, expected in sparse (coordinate) matrix market format. documents assumed rows of matrix (and document features columns). `input` either string (file path) or file-like object supports `seek()` (e.g. gzip.gzipfile, bz2.bz2file). """ logger.info("initializing corpus reader %s" % input) self.input, self.transposed = input, transposed if isinstance(input, basestring): input = open(input) header = input.next().strip() if not header.lower().startswith('%%matrixmarket matrix coordinate real general'): raise valueerror("file %s not in matrix market format coordinate real general; instead found: \n%s" % (self.input, header)) self.num_docs = self.num_terms = self.num_nnz = 0 lineno, line in enumerate(input): if not line.startswith('%'): self.num_docs, self.num_terms, self.num_nnz = map(int, line.split()) if not self.transposed: ## line 525 self.num_docs, self.num_terms = self.num_terms, self.num_docs break logger.info("accepted corpus %i documents, %i features, %i non-zero entries" % (self.num_docs, self.num_terms, self.num_nnz)) def __len__(self): return self.num_docs def __str__(self): return ("mmcorpus(%i documents, %i features, %i non-zero entries)" % (self.num_docs, self.num_terms, self.num_nnz)) def skip_headers(self, input_file): """ skip file headers appear before first document. """ line in input_file: if line.startswith('%'): continue break def __iter__(self): """ iteratively yield vectors underlying file, in format (row_no, vector), vector list of (col_no, value) 2-tuples. note total number of vectors returned equal number of rows specified in header; empty documents inserted , yielded appropriate, if not explicitly stored in matrix market file. """ if isinstance(self.input, basestring): fin = open(self.input) else: fin = self.input fin.seek(0) self.skip_headers(fin) previd = -1 line in fin: docid, termid, val = line.split() if not self.transposed: termid, docid = docid, termid docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) # -1 because matrix market indexes 1-based => convert 0-based assert previd <= docid, "matrix columns must come in ascending order" if docid != previd: # change of document: return document read far (its id previd) if previd >= 0: yield previd, document # return implicit (empty) documents between previous id , new id # too, keep consistent document numbering , corpus length previd in xrange(previd + 1, docid): yield previd, [] # on start adding fields new document, new id previd = docid document = [] document.append((termid, val,)) # add field current document # handle last document, special case if previd >= 0: yield previd, document # return empty documents between last explicit document , number # of documents specified in header previd in xrange(previd + 1, self.num_docs): yield previd, [] def docbyoffset(self, offset): """return document @ file offset `offset` (in bytes)""" # empty documents not stored explicitly in mm format, index marks # them special offset, -1. if offset == -1: return [] if isinstance(self.input, basestring): fin = open(self.input) else: fin = self.input fin.seek(offset) # works gzip/bz2 input, previd, document = -1, [] line in fin: docid, termid, val = line.split() if not self.transposed: ## line 567 termid, docid = docid, termid docid, termid, val = int(docid) - 1, int(termid) - 1, float(val) # -1 because matrix market indexes 1-based => convert 0-based assert previd <= docid, "matrix columns must come in ascending order" if docid != previd: if previd >= 0: return document previd = docid document.append((termid, val,)) # add field current document return document #endclass mmreader
apparently, transposed
parameter never used in latest version of gensim
format of mmreader
, mmwriter
same.
for more details, developer explained in https://groups.google.com/forum/?hl=en#!topic/gensim/xc7q_q3wcyq
Comments
Post a Comment