@GitHub_Daily: GitHub 上一份精心收集的数据工程师面试题库:data-engineering-interview-questions ,收录了超过 2000 道题。 覆盖数据库与数据仓库、大数据处理框架、云平台服务、数据格式、数据可视化等核心方向,…
摘要
GitHub 上有一个精心收集的数据工程师面试题库 data-engineering-interview-questions,收录了超过 2000 道题,覆盖数据库、大数据框架、云平台、数据可视化等核心方向。
查看缓存全文
缓存时间: 2026/06/10 09:48
GitHub 上一份精心收集的数据工程师面试题库:data-engineering-interview-questions ,收录了超过 2000 道题。
覆盖数据库与数据仓库、大数据处理框架、云平台服务、数据格式、数据可视化等核心方向,可以按主题逐个击破,也能用完整题单做全面模拟。
GitHub:http://github.com/OBenner/data-engineering-interview-questions…
数据库部分涵盖 Cassandra、MongoDB、HBase、Hive 以及 Redshift、BigQuery 等主流选型。
理论部分也没落下,数据建模、数据质量、系统设计、SQL 和 Python 都有专门的题目集。
每个主题还附带了官方文档和 Awesome 资源列表的链接,方便深入学习。
如果你正在准备数据工程相关的面试,这份题库值得收藏,系统刷一遍心里会踏实很多。
OBenner/data-engineering-interview-questions
Source: https://github.com/OBenner/data-engineering-interview-questions
More than 2000+ questions for preparing a Data Engineer interview.
Full list of questions
Pick a topic below or use the full list to practice end-to-end.
Interview questions for Data Engineer
| Databases and Data Warehouses | |||||
|---|---|---|---|---|---|
| GitHub Repo | Official page | Questions | Description | Useful links | |
| Apache Cassandra | Cassandra is a distributed, wide-column store, NoSQL database management system. | Awesome Cassandra | |||
| Greenplum | Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. | Awesome Greenplum | |||
| MongoDB | MongoDB is a document-oriented database. | Awesome MongoDB | |||
| Apache Hbase | HBase is an open-source non-relational distributed database. | Awesome HBase | |||
| Apache Hive | Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. | Awesome Hive | |||
| Amazon DynamoDB | Amazon DynamoDB is a fully managed proprietary NoSQL database service. | Awesome DynamoDB Awesome AWS | |||
| Amazon Redshift | Amazon Redshift is a data warehouse product. | Amazon Redshift Utilities Awesome AWS | |||
| BigQuery GCP | BigQuery is a fully-managed, serverless data warehouse. | Awesome BigQuery | |||
| Bigtable GCP | Bigtable is a fully managed wide-column and key-value NoSQL database service. | Awesome Bigtable | |||
| Data Formats | |||||
| Apache Avro | Avro is a row-oriented remote procedure call and data serialization framework. | Awesome Avro | |||
| Apache Parquet | Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. | Parquet format · Docs | |||
| Delta | Delta Lake is a storage framework that enables building a Lakehouse architecture with compute engines | Delta examples | |||
| Apache Iceberg | Apache Iceberg is an open table format for huge analytic datasets. | Iceberg docs | |||
| Apache Hudi | Apache Hudi brings upserts, deletes, and incremental processing to data lakes. | Hudi docs | |||
| Big Data Frameworks | |||||
| Apache Airflow | Apache Airflow is a workflow management platform for data engineering pipelines. | Awesome Airflow | |||
| Apache Flume | Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. | Flume User Guide | |||
| Apache Hadoop | Apache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. | Awesome Hadoop | |||
| Apache Impala | Apache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop. | Impala docs | |||
| Apache Kafka | Apache Kafka is a distributed event store and stream-processing platform. | Awesome Kafka | |||
| Apache NiFi | Apache NiFi is a software project designed to automate the flow of data between software systems. | Awesome NiFi | |||
| Apache Spark | Apache Spark is unified analytics engine for large-scale data processing. | Awesome Spark | |||
| Apache Flink | Apache Flink is unified stream-processing and batch-processing framework. | Awesome Flink | |||
| Kubernetes | Kubernetes is a system for managing containerized applications across multiple hosts. | Awesome Kubernetes | |||
| Cloud providers | |||||
| Amazon Web Services | Amazon web service is an online platform that provides scalable and cost-effective cloud computing solutions. | Awesome AWS | |||
| Microsoft Azure | Microsoft Azure is Microsoft's public cloud computing platform. | Awesome Azure | |||
| Google Cloud Platform | Google Cloud Platform is a suite of cloud computing services. | Awesome GCP | |||
| Modern Data Stack | |||||
| dbt | dbt is a transformation framework for building tested and documented SQL models. | dbt tests | |||
| Theory | |||||
| DWH Architectures | A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise. | Awesome databases | |||
| Change Data Capture (CDC) | CDC captures inserts/updates/deletes from source systems for low-latency ingestion. | Debezium docs | |||
| Data Modeling | Dimensional modeling concepts used to build reliable analytics datasets. | Kimball Group | |||
| Data Quality | Tests, monitoring, and practices to ensure datasets are trusted and correct. | Great Expectations docs | |||
| Data Observability | Monitoring and incident response practices for pipeline and dataset health. | OpenLineage | |||
| Data Governance | Ownership, policies, privacy, and access controls for data platforms. | DataHub | |||
| Cost Optimization | Practical techniques to reduce compute and storage costs while meeting SLAs. | Spark tuning | |||
| Python for Data Engineering | Python fundamentals for reliable, scalable data pipelines and tooling. | PyArrow docs | |||
| Data System Design | System design interview questions for batch/streaming data platforms. | Data mesh overview | |||
| Data Structures | A data structure is a specialized format for organizing, processing, retrieving and storing data. | Awesome Algorithms | |||
| SQL | SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). | Awesome SQL | |||
| Data visualization tools/BI | |||||
| Tableau | Tableau is a powerful data visualization tool used in the Business Intelligence. | Tableau Desktop docs | Looker | Looker is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time. | Looker docs |
| Apache Superset | Superset is a modern data exploration and data visualization platform | Superset docs | |||
Contribution
Please contribute to this repository to help it make better. Any change like new question, code improvement, doc improvement etc is very welcome.
See CONTRIBUTING.md for quick checks and guidelines.
相似文章
@GitHub_Daily: 想要转型 AI 开发或者准备系统设计面试,网上找到的大部分资料都是理论或者知识点已过时。 偶然看到 AI System Design Guide 这份在持续更新的系统性 AI 学习指南。 整理了 110 道面试真题和答题框架,涵盖 RAG…
推荐一份持续更新的AI系统设计学习指南,涵盖110道面试真题和答题框架,包括RAG架构、Agent智能体等核心技术栈。
@vintcessun: 一早翻到一个有意思的项目,改变了我对面试准备的认知。一直以为大厂面试刷题就够了,但本质上它考察的是完整的计算机科学知识体系。这个项目把离散的知识点串成了一个系统计划,从 Big-O、数据结构、算法到系统设计、面试技巧全覆盖,甚至包含如何写…
A popular GitHub project providing a comprehensive multi-month study plan for software engineering interviews, covering CS fundamentals, algorithms, system design, and resume tips.
@nuannuan_share: 如果我要在90天内找到一份20万美元的AI工程师工作,我不会去读学位。 我会精通这10个GitHub仓库。 1. awesome-llm-apps 生产级AI指南。RAG、智能体、多模态应用,附完整代码。10.6万+ stars。 仓库 …
一篇中文社交媒体帖子推荐了10个GitHub仓库,声称掌握这些仓库可在90天内帮助找到20万美元的AI工程师工作,涵盖LangChain、LangGraph、CrewAI、Ollama、Qdrant等主流AI开发框架和工具。
@xiaoying_eth: 这 10 个 GitHub 仓库,不该只被程序员知道。 1. TradingAgents 一个 AI 投资分析师团队。 基本面、情绪、新闻、技术 4 个分析师一起讨论策略,后面还有风险经理和执行代理。 相当于把一个迷你华尔街团队,塞进你的…
推荐了10个实用的GitHub开源项目,涵盖AI投资分析、多模型聊天客户端、视频生成引擎、金融终端、短视频自动生成、AI邮件客户端、语音克隆、域名信息收集、Claude技能集和API集成等。
@nini_incrypto_: 想学 AI 系统设计?!直接看顶级大厂的实战经验! GitHub 上这个神级仓库汇总了 130 多家大厂、超过 500 个真实的 GenAI 落地案例。 它不讲课本上的基础理论,专门拆解顶级团队在真实生产环境中的技术决策: 1.Uber:…
GitHub 上汇总了 130 多家大厂、超过 500 个真实 GenAI 落地案例的仓库,拆解顶级团队在生产环境中的技术决策,例如 Uber 的多模型供应商实时流量调度。