Lead / Staff Data Engineer - Data Platform
Apna
Software Engineering, Data Science
Bengaluru, Karnataka, India
Company: Apna
Team: Data Platform / Engineering
Location: Bangalore
Experience : 5-7 Years of Experience
Why Join Apna
At Apna, data is central to how we build products, understand users, improve employer outcomes, power recommendations, and scale decision-making. This role gives you the opportunity to build the backbone of Apna’s data platform and influence how data is used across the company.
You will work on real-world, high-scale problems across jobs, users, employers, communities, matching, growth, and AI-driven systems.
About the Role
Apna is looking for a Lead / Staff Data Engineer to build and scale our core data platform. This role will work on large-scale data pipelines, lakehouse architecture, query platforms, workflow orchestration, and data reliability systems that power analytics, product intelligence, machine learning, business dashboards, experimentation, and operational decision-making across Apna.
We are looking for someone who can think deeply about data architecture, design reliable pipelines, improve data quality, and help build a platform that can scale with Apna’s growth.
What You’ll Own:
You will be responsible for designing, building, and operating critical parts of Apna’s data platform, including:
- Building scalable batch and near-real-time data pipelines across product, business, growth, and ML use cases.
- Designing and improving our lakehouse architecture using technologies likeApache Hudi.
- Working with query engines such asPresto / Trinofor large-scale analytical workloads.
- Building and maintaining orchestration workflows usingApache Airflow.
- Creating reusable data models, curated datasets, and reliable data marts for analytics and product teams.
- Improving data platform reliability, observability, SLA tracking, lineage, and data quality checks.
- Optimizing storage, compute, query performance, and pipeline costs.
- Partnering with product, analytics, ML, and backend engineering teams to understand data needs and convert them into scalable platform solutions.
- Driving engineering standards around data modeling, schema evolution, partitioning, deduplication, backfills, replayability, and pipeline ownership.
- Mentoring data engineers and influencing architecture decisions across teams.
What We’re Looking For
Must Have
- Strong experience indata engineering, preferably at scale.
- Hands-on experience withApache Airflowor similar orchestration systems.
- Strong knowledge ofPresto / Trinoor other distributed query engines.
- Good understanding ofApache Hudiconcepts such as:
- Copy-on-write vs merge-on-read
- Upserts and deletes
- Incremental reads
- Compaction
- Clustering
- Timeline and commits
- Schema evolution
- Partitioning strategy
- Strong knowledge of distributed data processing and storage systems.
- Ability to design and build reliable ETL / ELT pipelines.
- Strong SQL skills and ability to debug complex data issues.
- Good understanding of different data architectures, including:
- Data warehouse
- Data lake
- Lakehouse
- Lambda architecture
- Kappa architecture
- Medallion architecture
- Event-driven data architecture
- Experience with data modeling for analytics and reporting.
- Strong programming skills in at least one language such asPython, Java, or Scala.
- Ability to reason about trade-offs between freshness, cost, reliability, latency, and complexity.
- Strong debugging and production ownership mindset.
Good to Have
- Experience with Kafka, Spark, Flink, Hive, Iceberg, Delta Lake, or BigQuery.
- Experience building internal data platforms or self-serve data infrastructure.
- Experience with data quality frameworks such as Great Expectations, Deequ, Soda, or custom validation systems.
- Exposure to ML feature pipelines or feature stores.
- Experience with metadata management, data catalogs, lineage, and governance.
- Experience with cloud infrastructure such as AWS, GCP, or Azure.
- Understanding of privacy, compliance, PII handling, and access control in data systems.
What Success Looks Like
In this role, success means:
- Critical business and product datasets are reliable, discoverable, and trusted.
- Pipelines are observable, recoverable, and have clear SLAs.
- Query performance improves across major analytical workloads.
- Data freshness and quality issues reduce significantly.
- Teams can build on top of the data platform faster without reinventing pipelines.
- The platform can scale with Apna’s user, job, employer, and engagement data.