machine learning - How to change threshold for precision and recall in python scikit-learn? -
i have heard people saying can adjust threshold tweak trade-off between precision , recall, can't find actual example of how that.
my code:
for in mass[k]: df = df_temp # reset df before each loop #$$ #$$ if 1==1: ###if == singleethnic: count+=1 ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay ############################################ ############################################ def ethnicity_target(row): try: if row[ethnicity_var] == ethnicity_tar: return 1 else: return 0 except: return none df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1) print '1=', ethnicity_tar print '0=', 'non-'+ethnicity_tar # random sampling smaller dataframe debugging rows = df.sample(n=subsample_size, random_state=seed) # seed gives fixed randomness df = dataframe(rows) print 'class count:' print df['ethnicity_scan'].value_counts() # assign x , y variables x = df.raw_name.values x2 = df.name.values x3 = df.gender.values x4 = df.location.values y = df.ethnicity_scan.values # feature extraction functions def feature_full_name(namestring): try: full_name = namestring if len(full_name) > 1: # not accept name 1 character return full_name else: return '?' except: return '?' def feature_full_last_name(namestring): try: last_name = namestring.rsplit(none, 1)[-1] if len(last_name) > 1: # not accept name 1 character return last_name else: return '?' except: return '?' def feature_full_first_name(namestring): try: first_name = namestring.rsplit(' ', 1)[0] if len(first_name) > 1: # not accept name 1 character return first_name else: return '?' except: return '?' # transform format of x variables, , spit out numpy array features my_dict = [{'last-name': feature_full_last_name(i)} in x] my_dict5 = [{'first-name': feature_full_first_name(i)} in x] all_dict = [] in range(0, len(my_dict)): temp_dict = dict( my_dict[i].items() + my_dict5[i].items() ) all_dict.append(temp_dict) newx = dv.fit_transform(all_dict) # separate training , testing data sets x_train, x_test, y_train, y_test = cross_validation.train_test_split(newx, y, test_size=testtrainsplit) # fitting x , y model, using training data classifierused2.fit(x_train, y_train) # making predictions using trained data y_train_predictions = classifierused2.predict(x_train) y_test_predictions = classifierused2.predict(x_test)
i tried replacing line "y_test_predictions = classifierused2.predict(x_test)" "y_test_predictions = classifierused2.predict(x_test) > 0.8"
, "y_test_predictions = classifierused2.predict(x_test) > 0.01"
, nothing changes drastically.
classifierused2.predict(x_test)
outputs predicted class (most 0s , 1s) each sample. want classifierused2.predict_proba(x_test)
outputs 2d array probabilities each class per sample. thresholding can like:
y_test_probabilities = classifierused2.predict_proba(x_test) # y_test_probabilities has shape = [n_samples, n_classes] y_test_predictions_high_precision = y_test_probabilities[:,1] > 0.8 y_test_predictions_high_recall = y_test_probabilities[:,1] > 0.1
y_test_predictions_high_precision
contain samples of class 1 while y_test_predictions_high_recall
predict class 1 more (and achieve higher recall) contain many false positives.
predict_proba
supported both classifiers use, logistic regression , svm.
Comments
Post a Comment