machine learning - How to change threshold for precision and recall in python scikit-learn? -


i have heard people saying can adjust threshold tweak trade-off between precision , recall, can't find actual example of how that.

my code:

for in mass[k]:     df = df_temp # reset df before each loop     #$$     #$$      if 1==1:     ###if == singleethnic:         count+=1         ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp         # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay         ############################################         ############################################          def ethnicity_target(row):             try:                 if row[ethnicity_var] == ethnicity_tar:                     return 1                 else:                     return 0             except: return none         df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)         print '1=', ethnicity_tar         print '0=', 'non-'+ethnicity_tar          # random sampling smaller dataframe debugging         rows = df.sample(n=subsample_size, random_state=seed) # seed gives fixed randomness         df = dataframe(rows)         print 'class count:'         print df['ethnicity_scan'].value_counts()          # assign x , y variables         x = df.raw_name.values         x2 = df.name.values         x3 = df.gender.values         x4 = df.location.values         y = df.ethnicity_scan.values          # feature extraction functions         def feature_full_name(namestring):             try:                 full_name = namestring                 if len(full_name) > 1: # not accept name 1 character                     return full_name                 else: return '?'             except: return '?'          def feature_full_last_name(namestring):             try:                 last_name = namestring.rsplit(none, 1)[-1]                 if len(last_name) > 1: # not accept name 1 character                     return last_name                 else: return '?'             except: return '?'          def feature_full_first_name(namestring):             try:                 first_name = namestring.rsplit(' ', 1)[0]                 if len(first_name) > 1: # not accept name 1 character                     return first_name                 else: return '?'             except: return '?'          # transform format of x variables, , spit out numpy array features         my_dict = [{'last-name': feature_full_last_name(i)} in x]         my_dict5 = [{'first-name': feature_full_first_name(i)} in x]          all_dict = []         in range(0, len(my_dict)):             temp_dict = dict(                 my_dict[i].items() + my_dict5[i].items()                 )             all_dict.append(temp_dict)          newx = dv.fit_transform(all_dict)          # separate training , testing data sets         x_train, x_test, y_train, y_test = cross_validation.train_test_split(newx, y, test_size=testtrainsplit)          # fitting x , y model, using training data         classifierused2.fit(x_train, y_train)          # making predictions using trained data         y_train_predictions = classifierused2.predict(x_train)         y_test_predictions = classifierused2.predict(x_test) 

i tried replacing line "y_test_predictions = classifierused2.predict(x_test)" "y_test_predictions = classifierused2.predict(x_test) > 0.8" , "y_test_predictions = classifierused2.predict(x_test) > 0.01", nothing changes drastically.

classifierused2.predict(x_test) outputs predicted class (most 0s , 1s) each sample. want classifierused2.predict_proba(x_test) outputs 2d array probabilities each class per sample. thresholding can like:

y_test_probabilities = classifierused2.predict_proba(x_test) # y_test_probabilities has shape = [n_samples, n_classes]  y_test_predictions_high_precision = y_test_probabilities[:,1] > 0.8 y_test_predictions_high_recall = y_test_probabilities[:,1] > 0.1 

y_test_predictions_high_precision contain samples of class 1 while y_test_predictions_high_recall predict class 1 more (and achieve higher recall) contain many false positives.

predict_proba supported both classifiers use, logistic regression , svm.


Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

c++ - Clear the memory after returning a vector in a function -

erlang - Saving a digraph to mnesia is hindered because of its side-effects -