How I Built a Resume Screening & Placement Prediction Model

Author: Raj Aryan | Published: May 12, 2025

In today's competitive job market and academic landscape, leveraging technology to streamline and enhance decision-making processes is crucial. This article explores a fascinating project that tackles two significant challenges: efficiently screening resumes and predicting student placement success. Developed using Python and a suite of powerful machine learning libraries, this project offers innovative solutions for HR departments and educational institutions alike.

Part 1: Smart Resume Screening

The sheer volume of job applications can be overwhelming for recruiters. Manually sifting through hundreds, if not thousands, of resumes is time-consuming and prone to human bias. This project introduces an intelligent system to automate and optimize this process.

The Solution

The resume screening component of this project aims to classify resumes into predefined job categories. It also provides an Applicant Tracking System (ATS) score by comparing resumes against job descriptions.

Key Features & Working

Data Foundation: The system is trained on a dataset of resumes categorized by job roles ("UpdatedResumeDataSet.csv").
Data Cleaning: Resumes are meticulously cleaned to remove irrelevant information such as URLs, special characters, and extra whitespace, ensuring the model focuses on meaningful content.
Understanding the Data: Exploratory data analysis techniques, including visualizations like count plots and pie charts, are used to understand the distribution of resume categories.
Transforming Text into Features:
- Label Encoding: Categorical job titles are converted into numerical representations for machine learning processing.
- TF-IDF Vectorization: The textual content of the resumes is converted into numerical vectors using Term Frequency-Inverse Document Frequency (TF-IDF), which highlights the importance of different words in the resumes.
Building the Classifier: A K-Nearest Neighbors (KNN) algorithm is trained to classify resumes based on their textual features. The model's performance is rigorously evaluated using metrics like accuracy and a detailed classification report.
Handling PDF Resumes: The system incorporates Optical Character Recognition (OCR) capabilities using pytesseract and pdf2image to extract text from resumes submitted in PDF format, broadening its applicability.
ATS Scoring for Job Matching: To further aid recruiters, the project calculates an ATS score using cosine similarity. This score measures the relevance of a resume to a given job description by comparing their keyword content. Suggestions are also provided to candidates on how to improve their resume based on missing keywords relevant to the job description.

Part 2: Predicting Placement Success

For educational institutions, understanding the factors that contribute to student placement and predicting outcomes can be invaluable. This part of the project focuses on building a model to forecast student placement likelihood.

The Approach

By analyzing various student attributes, this system aims to predict whether a student will be placed or not.

Key Features & Working:

Data Source: The placement prediction model utilizes a dataset named "collegePlace.csv," containing student information such as age, gender, academic stream, internships, CGPA, hostel accommodation, and history of backlogs.
Data Preparation: The dataset undergoes preprocessing steps including the removal of duplicate entries and appropriate formatting of columns.
Gaining Insights: Visualizations help in understanding the relationships between different student attributes and placement outcomes. This includes analyzing distributions by gender, academic stream, and correlations between numerical features.
Feature Engineering for Prediction: Categorical data like 'Gender' and 'Stream' are converted into numerical formats using Label Encoding and One-Hot Encoding to make them suitable for machine learning algorithms.
Predictive Modeling: The project employs two popular classification algorithms:
- Logistic Regression: A statistical model used to predict binary outcomes (Placed/Not Placed).
- Support Vector Classifier (SVC): A powerful algorithm that finds an optimal hyperplane to separate data points into different classes.
Evaluating Model Performance: The accuracy of these models is assessed using metrics like accuracy scores and confusion matrices, providing insights into their predictive power. The primary features used for training in the documented example are 'CGPA' and 'Internships'.

Technologies Powering the Project

This project leverages a robust stack of Python libraries, demonstrating a comprehensive approach to data science and machine learning:

Data Manipulation & Analysis: Pandas, NumPy
Data Visualization: Matplotlib, Seaborn
Machine Learning: Scikit-learn (for TF-IDF, KNN, Logistic Regression, SVC, train-test split, evaluation metrics)
Natural Language Processing: Spacy (though NLTK is also often used, Spacy is explicitly mentioned for keyword extraction )
OCR: Pytesseract, pdf2image
File Handling & Serialization: os, zipfile, pickle

Conclusion

This dual-purpose project showcases the transformative potential of machine learning in automating and enhancing critical aspects of career development and recruitment. The resume screening system offers a streamlined approach to identifying suitable candidates, saving time and reducing bias. Simultaneously, the placement prediction model provides educational institutions with valuable insights into student employability, enabling them to offer targeted support and guidance.

The meticulous data preprocessing, thoughtful feature engineering, and application of appropriate machine learning algorithms underscore the project's robust design. While the documented accuracy scores are promising (e.g., KNN for resume screening achieving high accuracy, and Logistic Regression for placement prediction showing good results), there's always room for further refinement, such as exploring more advanced models or incorporating larger and more diverse datasets.

Overall, this project serves as an excellent example of how AI can be practically applied to solve real-world challenges in the HR and education sectors, paving the way for more efficient and data-driven decision-making.

Full Code: GitHub