We are seeking a Data Engineer / Python Developer to lead the data acquisition and processing efforts for a high-stakes, agentic AI chatbot in the healthcare domain. This is not a traditional BI or ETL role; you will not be building dashboards or moving data for analytics. Instead, you will architect a robust, modular engine capable of crawling, parsing, and normalizing vast amounts of unstructured and structured healthcare data from diverse sources—ranging from dynamic JavaScript websites and PDFs to proprietary vendor formats.
Location : Toronto, ON(1day / week onsite)
Key Responsibilities
Data Collection & Web Crawling
- Advanced Web Scraping : Build and maintain scalable scrapers for HTML and dynamic, JavaScript-heavy websites using Scrapy and BeautifulSoup .
- Multi-Format Ingestion : Develop custom parsers to ingest and normalize data from XML, RSS feeds, JSON, PDFs, database dumps, and non-standard vendor formats.
- Source Management : Manage a large catalog of external public and proprietary data sources, ensuring raw data is persisted reliably.
Pipeline Architecture & Normalization
Modular Engineering : Design and implement modular, reusable Python components to transform raw, heterogeneous data into standardized intermediate formats (e.g., JSON lines , Parquet ).Orchestration : Build and manage automated pipelines using Apache Airflow that can re-run processes, detect changes at the source, and perform incremental updates.AI Integration Support : Collaborate with AI engineers to implement data chunking, vectorization logic, and ingestion into Vector Databases .Security & Compliance : Implement rigorous data handling protocols to manage PII (Personally Identifiable Information) and PHI (Protected Health Information) within a secure healthcare environment.Core Skills
Expert Python : Deep experience in backend Python development with a focus on data processing.Web Scraping Stack : Mastery of Scrapy , BeautifulSoup , and tools for handling dynamic content (e.g., Selenium, Playwright, or headless browsers).Orchestration : Professional experience building and monitoring pipelines in Apache Airflow .Data Formatting : Proficiency in handling diverse serialization formats (JSONL, Parquet, XML) and unstructured data (PDF parsing).Experience & Qualifications
Healthcare Domain : Prior experience working with sensitive data, including PII / PHI and adhering to security compliance (e.g., HIPAA).Cloud Platforms : Strong preference for candidates with hands-on experience in GCP (BigQuery, Cloud Functions, GCS).Collaborative Mindset : Proven ability to work in a team-oriented environment, collaborating closely with AI and Backend engineers.Best Practices : Strong grasp of software engineering principles (DRY, SOLID) and data engineering patterns.