Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user (or some other information source), to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary. [1][2][3] In statistics literature, it is sometimes also called optimal experimental design.[4] The information source is also called teacher or oracle.

There are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach, there is a risk that the algorithm is overwhelmed by uninformative examples. Recent developments are dedicated to multi-label active learning,[5] hybrid active learning[6] and active learning in a single-pass (on-line) context,[7] combining concepts from the field of machine learning (e.g. conflict and ignorance) with adaptive, incremental learning policies in the field of online machine learning. Using active learning allows for faster development of a machine learning algorithm, when comparative updates would require a quantum or super computer.[8]

Large-scale active learning projects may benefit from crowdsourcing frameworks such as Amazon Mechanical Turk that include many humans in the active learning loop.


Let T be the total set of all data under consideration. For example, in a protein engineering problem, T would include all proteins that are known to have a certain interesting activity and all additional proteins that one might want to test for that activity.

During each iteration, i, T is broken up into three subsets

  1. : Data points where the label is known.
  2. : Data points where the label is unknown.
  3. : A subset of TU,i that is chosen to be labeled.

Most of the current research in active learning involves the best method to choose the data points for TC,i.


Query strategies

Algorithms for determining which data points should be labeled can be organized into a number of different categories, based upon their purpose:[1]

A wide variety of algorithms have been studied that fall into these categories.[1][4] While the traditional AL strategies can achieve remarkable performance, it is often challenging to predict in advance which strategy is the most suitable in aparticular situation. In recent years, meta-learning algorithms have been gaining in popularity. Some of them have been proposed to tackle the problem of learning AL strategies instead of relying on manually designed strategies. A benchmark which compares 'meta-learning approaches to active learning' to 'traditional heuristic-based Active Learning' may give intuitions if 'Learning active learning' is at the crossroads [18]

Minimum marginal hyperplane

Some active learning algorithms are built upon support-vector machines (SVMs) and exploit the structure of the SVM to determine which data points to label. Such methods usually calculate the margin, W, of each unlabeled datum in TU,i and treat W as an n-dimensional distance from that datum to the separating hyperplane.

Minimum Marginal Hyperplane methods assume that the data with the smallest W are those that the SVM is most uncertain about and therefore should be placed in TC,i to be labeled. Other similar methods, such as Maximum Marginal Hyperplane, choose data with the largest W. Tradeoff methods choose a mix of the smallest and largest Ws.

See also



  1. ^ a b c Settles, Burr (2010). "Active Learning Literature Survey" (PDF). Computer Sciences Technical Report 1648. University of Wisconsin–Madison. Retrieved 2014-11-18.
  2. ^ Rubens, Neil; Elahi, Mehdi; Sugiyama, Masashi; Kaplan, Dain (2016). "Active Learning in Recommender Systems". In Ricci, Francesco; Rokach, Lior; Shapira, Bracha (eds.). Recommender Systems Handbook (PDF) (2 ed.). Springer US. doi:10.1007/978-1-4899-7637-6. hdl:11311/1006123. ISBN 978-1-4899-7637-6. S2CID 11569603.
  3. ^ Das, Shubhomoy; Wong, Weng-Keen; Dietterich, Thomas; Fern, Alan; Emmott, Andrew (2016). "Incorporating Expert Feedback into Active Anomaly Discovery". In Bonchi, Francesco; Domingo-Ferrer, Josep; Baeza-Yates, Ricardo; Zhou, Zhi-Hua; Wu, Xindong (eds.). IEEE 16th International Conference on Data Mining. IEEE. pp. 853–858. doi:10.1109/ICDM.2016.0102. ISBN 978-1-5090-5473-2. S2CID 15285595.
  4. ^ a b Olsson, Fredrik (April 2009). "A literature survey of active machine learning in the context of natural language processing". SICS Technical Report T2009:06.
  5. ^ Yang, Bishan; Sun, Jian-Tao; Wang, Tengjiao; Chen, Zheng (2009). "Effective multi-label active learning for text classification" (PDF). Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. p. 917. CiteSeerX doi:10.1145/1557019.1557119. ISBN 978-1-60558-495-9. S2CID 1979173.
  6. ^ Lughofer, Edwin (February 2012). "Hybrid active learning for reducing the annotation effort of operators in classification systems". Pattern Recognition. 45 (2): 884–896. Bibcode:2012PatRe..45..884L. doi:10.1016/j.patcog.2011.08.009.
  7. ^ Lughofer, Edwin (2012). "Single-pass active learning with conflict and ignorance". Evolving Systems. 3 (4): 251–271. doi:10.1007/s12530-012-9060-7. S2CID 43844282.
  8. ^ Novikov, Ivan (2021). "The MLIP package: moment tensor potentials with MPI and active learning". IOP Publishing. 2 (2): 3, 4. arXiv:2007.08555. doi:10.1088/2632-2153/abc9fe – via IOP science.
  9. ^ DataRobot. "Active learning machine learning: What it is and how it works". DataRobot Blog. DataRobot Inc. Retrieved 30 January 2024.
  10. ^ Wang, Liantao; Hu, Xuelei; Yuan, Bo; Lu, Jianfeng (2015-01-05). "Active learning via query synthesis and nearest neighbour search" (PDF). Neurocomputing. 147: 426–434. doi:10.1016/j.neucom.2014.06.042. S2CID 3027214.
  11. ^ Bouneffouf, Djallel; Laroche, Romain; Urvoy, Tanguy; Féraud, Raphael; Allesiardo, Robin (2014). "Contextual Bandit for Active Learning: Active Thompson". In Loo, C. K.; Yap, K. S.; Wong, K. W.; Teoh, A.; Huang, K. (eds.). Neural Information Processing (PDF). Lecture Notes in Computer Science. Vol. 8834. pp. 405–412. doi:10.1007/978-3-319-12637-1_51. ISBN 978-3-319-12636-4. S2CID 1701357. HAL Id: hal-01069802.
  12. ^ Bouneffouf, Djallel (8 January 2016). "Exponentiated Gradient Exploration for Active Learning". Computers. 5 (1): 1. arXiv:1408.2196. doi:10.3390/computers5010001. S2CID 14313852.
  13. ^ a b c d Faria, Bruno; Perdigão, Dylan; Brás, Joana; Macedo, Luis (2022). The Joint Role of Batch Size and Query Strategy in Active Learning-Based Prediction - A Case Study in the Heart Attack Domain. Lecture Notes in Computer Science. Vol. 13566. pp. 464–475. doi:10.1007/978-3-031-16474-3_38. ISBN 978-3-031-16473-6. ((cite book)): |journal= ignored (help)
  14. ^ "shubhomoydas/ad_examples". GitHub. Retrieved 2018-12-04.
  15. ^ Makili, Lázaro Emílio; Sánchez, Jesús A. Vega; Dormido-Canto, Sebastián (2012-10-01). "Active Learning Using Conformal Predictors: Application to Image Classification". Fusion Science and Technology. 62 (2): 347–355. doi:10.13182/FST12-A14626. ISSN 1536-1055. S2CID 115384000.
  16. ^ Zhao, Shuyang; Heittola, Toni; Virtanen, Tuomas (2020). "Active learning for sound event detection". IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv:2002.05033.
  17. ^ Bernard, Jürgen; Zeppelzauer, Matthias; Lehmann, Markus; Müller, Martin; Sedlmair, Michael (June 2018). "Towards User-Centered Active Learning Algorithms". Computer Graphics Forum. 37 (3): 121–132. doi:10.1111/cgf.13406. ISSN 0167-7055. S2CID 51875861.
  18. ^ Desreumaux, Louis; Lemaire, Vincent (2020). "Learning Active Learning at the Crossroads? Evaluation and Discussion" (Proceedings of the Workshop on Interactive Adaptive Learning co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases {(ECML} {PKDD} 2020), Ghent, Belgium, 2020). S2CID 221794570. ((cite journal)): Cite journal requires |journal= (help)