
Overview
Key Features
- BERT-based Document Clustering: Engineered an unsupervised clustering pipeline using contextual embeddings to classify 1.2M+ multilingual material descriptions across SLB's global operations.
- Dimensionality Reduction: Applied PCA and t-SNE techniques on high-dimensional text embeddings to visualize document relationships and identify procurement patterns.
- Automated Classification: Developed K-Means clustering algorithms to extract the top 100 procurement-relevant categories, enabling supplier consolidation and cost-control strategies.
- Interactive Visualizations: Created comprehensive data visualizations including word clouds, similarity matrices, and cluster projections for stakeholder presentations.
Technologies Used
- Python: Core programming language with NumPy, Pandas, and Scikit-learn for data processing and analysis.
- BERT & Transformers: For generating contextual embeddings and semantic understanding of technical documents.
- Machine Learning: K-Means clustering, PCA, t-SNE for dimensionality reduction and pattern discovery.
- Data Visualization: Matplotlib, Seaborn, and custom plotting libraries for creating insightful visualizations.
Challenges and Learnings
Outcome
This project demonstrates advanced NLP capabilities and the ability to extract actionable business insights from large-scale technical documentation in the oil and gas industry.