You must have come across a couple of posts about the Azure data factory interview questions and answer but none of them has been talking about the real-time scenario-based Azure data factory interview questions and answers. In this post, I will take you through the Azure data factory real-time scenario and Azure databricks interview questions answers for experienced Azure data factory developer.
- 1 Question 1 : Assume that you are a data engineer for company ABC The company wanted to do cloud migration from their on-premises to Microsoft Azure cloud. You probably will use the Azure data factory for this purpose. You have created a pipeline that copies data of one table from on-premises to Azure cloud. What are the necessary steps you need to take to ensure this pipeline will get executed successfully?
- 2 Question 2: Assume that you are working for a company ABC as a data engineer. You have successfully created a pipeline needed for migration. This is working fine in your development environment. how would you deploy this pipeline in production without making any or very minimal changes?
- 3 Question 3: Assume that you have around 1 TB of data stored in Azure blob storage . This data is in multiple csv files. You are asked to do couple of transformations on this data as per business logic and needs, before moving this data into the staging container. How would you plan and architect the solution for this given scenario. Explain with the details.
- 4 Question 4: Assume that you have an IoT device enabled on your vehicle. This device from the vehicle sends the data every hour and this is getting stored in a blob storage location in Microsoft Azure. You have to move this data from this storage location into the SQL database. How would design the solution explain with reason.
- 6 Final Thoughts :
Question 1 : Assume that you are a data engineer for company ABC The company wanted to do cloud migration from their on-premises to Microsoft Azure cloud. You probably will use the Azure data factory for this purpose. You have created a pipeline that copies data of one table from on-premises to Azure cloud. What are the necessary steps you need to take to ensure this pipeline will get executed successfully?
The company has taken a very good decision of moving to the cloud from the traditional on-premises database. As we have to move the data from the on-premise location to the cloud location we need to have an Integration Runtime created. The reason being the auto-resolve Integration runtime provided by the Azure data factory cannot connect to your on-premises. Hence in step 1, we should create our own self-hosted integration runtime. Now this can be done in two ways:
The first way is we can have one virtual Machine ready in the cloud and there we will install the integration runtime of our own.
The second way, we could take a machine on the on-premises network and install the integration runtime there.
Once we decided on the machine where integration runtime needs to be installed (let’s take the virtual machine approach). You need to follow these steps for Integration runtime installation.
- Go to the azure data factory portal. In the manage tab select the Integration runtime.
- Create self hosted integration runtime by simply giving general information like name description.
- Create Azure VM (If u already have then you can skip this step)
- Download the integration runtime software on azure virtual machine. and install it.
- Copy the autogenerated key from step 2 and paste it newly installed integration runtime on azure vm.
You can follow this link for detailed step by step guide to understand the process of how to install sefl-hosted Integration runtime. How to Install Self-Hosted Integration Runtime on Azure vm – AzureLib
Once your Integration runtime is ready we go to linked service creation. Create the linked service which connect to the your data source and for this you use the integration runtime created above.
After this we will create the pipeline. Your pipeline will have copy activity where source should be the database available on the on-premises location. While sink would be the database available in the cloud.
Once all of these done we execute the pipeline and this will be the one-time load as per the problem statement. This will successfully move the data from a table on on-premises database to the cloud database.
Question 2: Assume that you are working for a company ABC as a data engineer. You have successfully created a pipeline needed for migration. This is working fine in your development environment. how would you deploy this pipeline in production without making any or very minimal changes?
When you create the pipeline for migration or for any other purposes like ETL, most of the time it will use the data source. In the above mentioned scenario, we are doing the migration hence it is definitely using a data source at the source side and similarly a data source at the destination side and we need to move the data from source to destination. It is also described in the in the question itself data engineer has developed the pipeline successfully in the development environment. Hence it is safe to assume that source side data source and destination side data source both probably will be pointing to the development environment only. Pipeline would have copy activity which uses the dataset with the help of linked service for source and sink.
Linked service provides way to connect to the data source by providing the data source details like the server address, port number, username, password, key, or other credential related information.
In this case, our linked services probably pointing to the development environment only.
As we want to do production deployment before that we may need to do a couple of other deployments as well like deployment for the testing environment or UAT environment.
Hence we need to design our Azure data factory pipeline components in such a way that we can provide the environment related information dynamic and as a part of a parameter. There should be no hard coding of these kind of information.
We need to create the arm template for our pipeline. ARM template needs to have a definition defined for all the constituents of the pipeline like Linked services, dataset, activities and pipeline.
Once the ARM template is ready, it should be checked-in into the GIT repository. Lead or Admin will create the devops pipeline which will take up this arm template and parameter file as an input. Devops pipeline will deploy this arm template and create all the resources like linked service, dataset, activities and your data pipeline into the production environment.
- For Azure Study material Join Telegram group : Telegram group link:
- Azure Jobs and other updates Follow me on LinkedIn: Azure Updates on LinkedIn
- Azure Tutorial Videos: Videos Link
Question 3: Assume that you have around 1 TB of data stored in Azure blob storage . This data is in multiple csv files. You are asked to do couple of transformations on this data as per business logic and needs, before moving this data into the staging container. How would you plan and architect the solution for this given scenario. Explain with the details.
First of all, we need to analyze the situation. Here if you closely look at the size of the data, you find that it is very huge in the size. Hence directly doing the transformation on such a huge size of data could be very cumbersome and time consuming process. Hence we should think about the big data processing mechanism where we can leverage the parallel and distributed computing advantages.. Here we have two choices.
- We can use the Hadoop MapReduce through HDInsight capability for doing the transformation.
- We can also think of using the spark through the Azure databricks for doing the transformation on such a huge scale of data.
Out of these two, Spark on Azure databricks is better choice because Spark is much faster than Hadoop due to in memory computation. So let’s choose the Azure databricks as the option.
Next we need to create the pipeline in Azure data factory. A pipeline should use the databricks notebook as an activity.
We can write all the business related transformation logic into the Spark notebook. Notebook can be executed using either python, scala or java language.
When you execute the pipeline it will trigger the Azure databricks notebook and your analytics algorithm logic runs an do transformations as you defined into the Notebook. In the notebook itself, you can write the logic to store the output into the blob storage Staging area.
That’s how you can solve the problem statement.
Get Crack Azure Data Engineer Interview Course
– 125+ Interview questions
– 8 hrs long Pre- recorded video course
– Basic Interview Questions with video explanation
– Tough Interview Questions with video explanation
– Scenario based real world Questions with video explanation
– Practical/Machine/Written Test Interview Q&A
– Azure Architect Level Interview Questions
– Cheat sheets
– Life time access
– Continuous New Question Additions
Here is the link to get Azure Data Engineer prep Course
Question 4: Assume that you have an IoT device enabled on your vehicle. This device from the vehicle sends the data every hour and this is getting stored in a blob storage location in Microsoft Azure. You have to move this data from this storage location into the SQL database. How would design the solution explain with reason.
This looks like an a typical incremental load scenario. As described in the problem statement, IoT device write the data to the location every hour. It is most likely that this device is sending the JSON data to the cloud storage (as most of the IoT device generate the data in JSON format). It will probably writing the new JSON file every time whenever the data from the device sent to the cloud.
Hence we will have couple of files available in the storage location generated on hourly basis and we need to pull these file into the azure sql database.
we need to create the pipeline into the Azure data factory which should do the incremental load. we can use the conventional high watermark file mechanism for solving this problem.
Highwater mark design is as follows :
- Create a file named lets say HighWaterMark.txt and stored in some place in azure blob storage. In this file we will put the start date and time.
- Now create the pipeline in the azure data factory. Pipeline has the first activity defined as lookup activity. This will read the date from the HighWaterMark.txt
- Add a one more lookup activity which will return the current date time.
- Add the copy activity in the pipeline which will pull the file JSON files having created timestamp greater than High Water Mark date. In the sink push the read data into the azure sql database.
- After copy activity add the another copy activity which will update the current date time generated in the step 2, to the High Water Mark file.
- Add the trigger to execute this pipeline on hourly basis.
That’s how we can design the incremental data load solution for the above described scenario.
Question 5: Assume that you are doing some R&D over the data about the COVID across the world. This data is available by some of the public forum which is exposed as REST api. How would you plan the solution in this scenario?
You would also like to see these interview questions as well for your Azure Data engineer Interview :
Azure Devops Interview Questions and Answers
Azure Data lake Interview Questions and Answers
Azure Active Directory Interview Questions and Answers
Azure Databricks Spark Interview Questions and Answers
Azure Data Factory Interview Questions and Answers
I would recommend you to must this YouTube channel once, there is very good source available on azure data factory and Azure.
Final Thoughts :
Azure data factory is the new field and due to this there is shortage of resources available on the internet which needed for preparing for azure data factory (adf) interviews. In this blog I tried to provide many real world scenario based interview questions and answers for experienced adf developer and professionals. I will on adding few more questions in near time I would recommend you, to also grow the theoretical questions sum up in this linked article. Here : Mostly asked Azure Data Factory Interview Questions and Answers
No, it isn't necessary to have good wisdom in coding for Azure Data Factory. Azure Data Factory provides 90 built-in connectors to transform the data using mapping data flow activities without the wisdom of programming skills or spark cluster knowledge.Which three types of activities can you run in Microsoft Azure Data Factory? ›
Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.Is coding required for Azure Data Factory? ›
No Need To Code
ADF uses Azure DataBricks as the compute for the data transformations built.
With Azure Data Factory, it's fast and easy to build code-free or code-centric ETL and ELT processes. In this scenario, learn how to create code-free pipelines within an intuitive visual environment. In today's data-driven world, big data processing is a critical task for every organization.How much time will it take to learn Azure Data Factory? ›
You'll be Azure Data Lake & Data Factory trained in just 2 days.What are the limitations of Azure Data Factory? ›
|Resource||Default limit||Maximum limit|
|Data sets within a data factory||5,000||Contact support.|
|Concurrent slices per data set||10||10|
|Bytes per object for pipeline objects1||200 KB||200 KB|
|Bytes per object for data set and linked service objects1||100 KB||2,000 KB|
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to copy data from a SQL Server database to an Azure Blob Storage.What is the difference between pipeline and data flow in Azure Data Factory? ›
Pipelines are for process orchestration. Data Flow is for data transformation. In ADF, Data Flows are built on Spark using data that is in Azure (blob, adls, SQL, synapse, cosmosdb). Connectors in pipelines are for copying data and job orchestration.Is Azure Data Factory SAAS or PaaS? ›
Azure Data Factory (ADF) is a Microsoft Azure PaaS solution for data transformation and load. ADF supports data movement between many on premises and cloud data sources.Is Azure Data Factory in demand? ›
According to Microsoft, almost 365,000 businesses register for the Azure platform each year. This indicates that Microsoft Azure Data Engineers are in high demand.
Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.What language is Azure Data Factory written in? ›
Language support includes . NET, PowerShell, Python, and REST. Monitoring: You can monitor your Data Factories via PowerShell, SDK, or the Visual Monitoring Tools in the browser user interface.Is Azure data/factory worth learning? ›
Azure Data Factory is a good product when you design a good data engineering architecture. Azure Data Factory helps us on data movement, integration and transformation.Can you write Python code in Azure Data Factory? ›
Yes, You can Upload the python script into Azure blob storage , First connect Blob storage to VM and also we can use AzCopy to upload files into Azure Blob Storage. Follow these steps for Custom batch activity and use Below Reference in which we have detailed information about: Create the Azure Batch Account.Can Azure Data Factory read Excel file? ›
The service supports both ". xls" and ". xlsx". Excel format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure Files, File System, FTP, Google Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.Can I learn Azure without IT background? ›
The AZ-900 course has been designed from the ground up to teach Azure fundamentals to people with little or no technical background at all.How do I practice Azure Data Factory? ›
- Only One Pipeline for All Tables/Objects. ...
- Rerun the Pipelines. ...
- Incremental Copies. ...
- Managing Data Store Credentials. ...
- Network Security. ...
- Application Logging and Auditing. ...
- Custom Logs for Your Data Pipeline.
DP-203 Azure Data Engineer Certification - Data Factory Training.Can Azure Data Factory call stored procedure? ›
Executing Stored Procedure from Azure Data Factory. Navigate to the Azure Data Factory instance in the Azure portal and click on the Author & Monitor link that will open the Data Factory portal as shown below. Since we intend to create a new data pipeline, click on the Create pipeline icon in the portal.What is Max concurrent connections in Azure Data Factory? ›
|Resource||Default limit||Maximum limit|
|Concurrent authoring operations per subscription per Azure Integration Runtime region Including test connection, browse folder list and table list, preview data.||200||Contact support.|
As has been alluded to in the previous section, the Azure Data Factory Activities can be broadly categorized into the following three groups: Azure Data Factory Activities: Data Movement. Azure Data Factory Activities: Data Transformation. Azure Data Factory Activities: Data Control.Can you have more than 1 origin in a StreamSets pipeline? ›
StreamSets Data Collector pipelines can only have a single origin, so this isn't possible with or without the SDK.Can you clone a pipeline? ›
If the pipeline to copy is in the same project, you can clone it, and if it is in a different project you can export it from that project and import it into your project. For information in migrating a classic build pipeline to YAML using Export to YAML, see Migrate from classic pipelines.How do you automate an Azure data/factory pipeline? ›
- Open an Azure DevOps project, and go to Pipelines. Select New Pipeline.
- Select the repository where you want to save your pipeline YAML script. We recommend saving it in a build folder in the same repository of your Data Factory resources. ...
- Select Starter pipeline.
With Azure Data Factory, it is fast and easy to build code-free or code-centric ETL and ELT processes. In this scenario, learn how to create code-free pipelines within an intuitive visual environment.What are triggers in ADF? ›
Azure Data Factory Triggers are used to schedule a Data Pipeline runs without any interventions. In other words, an Azure Data Factory Trigger is a processing unit that determines when to begin or invoke an end-to-end pipeline execution in Azure Data Factory.How do I manually run a pipeline in Azure Data Factory? ›
Select Trigger on the toolbar, and then select Trigger Now. On the Pipeline Run page, select OK. Go to the Monitor tab on the left. You see a pipeline run that is triggered by a manual trigger.What is similar to Azure Data Factory? ›
Azure Databricks, Talend, AWS Data Pipeline, AWS Glue, and Apache NiFi are the most popular alternatives and competitors to Azure Data Factory.What kind of tool is Azure Data Factory? ›
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.Is Azure data/factory same as SSIS? ›
Azure Data Factory supports both batch and streaming data processes while SSIS supports only batch processes. Azure Data Factory allows you to define a series of tasks that need to be performed on data, such as copying data from one location to another, analyzing it and storing it in a database.
Also, help in automating the process of data processing and designing. Thus, an expert in this skill will have plenty of job roles to play. There are multiple companies to hire these skills. In this way, we can say that this popular ETL software will have a good demand in the future IT market.Can I get job in Azure without experience? ›
Candidate must possess relevant professional experience in system engineering or computer related. Understanding and experience working with Microsoft…Is Azure better than Python? ›
Python has a broader approval, being mentioned in 2830 company stacks & 3641 developers stacks; compared to Azure Machine Learning, which is listed in 12 company stacks and 8 developer stacks.Which ETL tool is easiest? ›
Which ETL tool is easiest? It depends from user to user but some of the easiest ETL Tools that you can learn are Hevo, Dataddo, Talend, Apache Nifi because of their simple-to-understand UI and as they don't require too much technical knowledge.Does Azure Data Factory use spark? ›
Azure Data Factory automatically creates an HDInsight cluster and runs the Spark program.Is Azure data/factory a tool? ›
Azure Data Factory is a user interface tool which offers a very graphical overview to create/manage activities and pipelines. It doesn't require coding skills, yet complex transformation will require Azure Data Factory experience.What is the difference between data/factory and Databricks? ›
The last and most significant difference between the two tools is that ADF is generally used for data movement, ETL process, and data orchestration whereas; Databricks helps in data streaming and data collaboration in real-time.What is SSIS in Azure Data Factory? ›
An Azure-SSIS IR supports: Running packages deployed into SSIS catalog (SSISDB) hosted by Azure SQL Database server/Managed Instance (Project Deployment Model) Running packages deployed into file system, Azure Files, or SQL Server database (MSDB) hosted by Azure SQL Managed Instance (Package Deployment Model)How popular is Azure Data Factory? ›
Azure Data Factory is the #1 ranked solution in top Data Integration Tools and #2 ranked solution in top Cloud Data Warehouse tools.What is prerequisite to learn Azure Data Factory? ›
The following pre-requisite should be completed: Successfully login to the Azure portal. Understand the Azure storage options. Understand the Azure compute options.
Error handling. Azure Data Factory and Synapse Pipeline orchestration allows conditional logic and enables user to take different based upon outcomes of a previous activity. Using different paths allow users to build robust pipelines and incorporates error handling in ETL/ELT logic.How do I run a batch file from Azure Data Factory? ›
Create a Pool with compute nodes:
- In the Batch account, select Pools > Add.
- Enter a Pool ID called mypool.
- In Operating System, select the following settings (you can explore other options).
If you are an advanced user and looking for a programmatic interface, Data Factory provides a rich set of SDKs that you can use to author, manage, or monitor pipelines by using your favorite IDE. Language support includes . NET, PowerShell, Python, and REST.How do I run a SQL script from Azure Data Factory? ›
- Truncate a table or view in preparation for inserting data.
- Create, alter, and drop database objects such as tables and views.
- Re-create fact and dimension tables before loading data into them.
- Run stored procedures.
1. Azure Solutions Architect Expert. Earning the Azure Solutions Architect Expert certification and taking down its two demanding certification exams is one of the most challenging feats in cloud certs.Is Microsoft Azure difficult to learn? ›
Azure is a cloud platform provided by Microsoft. It provides a wide range of services such as compute, storage, networking, and applications. Azure is easy to learn. It has a well-defined interface, and the tutorials are well written.Is Azure data/factory any good? ›
Azure Data Factory is a good product when you design a good data engineering architecture. Azure Data Factory helps us on data movement, integration and transformation. We can automate the process on collecting data sources (e.g. file system) and move the data to destination (e.g Azure Synapse).Is AWS harder than Azure? ›
Both Azure and AWS can be hard if you're trying to learn the platform without having a clue about what you're doing, or it can be very easy if you're sufficiently guided. However, many IT professionals claim AWS is much easier to learn and to get certified in.What are the top 3 certifications in Azure? ›
- Microsoft Azure Fundamentals – AZ-900 Exam.
- Microsoft Azure Administrator – AZ-103.
- Microsoft Azure Developer – AZ-203.
- Microsoft Azure Security Engineer – AZ-500.
- Microsoft Azure AI Engineer – AI-100.
- Microsoft Azure Data Scientist – DP-100.
Microsoft Certified: Azure Solutions Architect Expert
Undoubtedly a challenging and top highest-paying certification in the cloud. It is estimated that the average salary for a Microsoft Certified Azure Solutions Architect Expert is around USD 135,000.
Azure Cloud Engineer salary in India ranges between ₹ 3.9 Lakhs to ₹ 15.3 Lakhs with an average annual salary of ₹ 6.1 Lakhs.Can a non coder learn Azure? ›
You do not need coding skills to use Microsoft Azure.
If you are interested in learning Microsoft Azure, or are already a seasoned Azure Administrator, but have previously had no experience with coding.
Yes! There is no pre-requisite in learning Azure and the AZ-900 in this platform will help you understand Azure basics and for sure can make you explain what each and every offering Azure currently has.Which language is best for Azure? ›
Most people would say that the best programming language for Azure is Node. js because it is a very easy-to-learn and powerful language. However, C# is also a better choice to learn if you want to have a career in enterprise development.What language is used in Azure Data Factory? ›
Azure Data Factory Studio displays English while the setting is Japanese. - Microsoft Q&A.What is AWS equivalent to Azure Data Factory? ›
AWS Glue and Azure Data Factory serve similar purposes. Both provide managed extract, transform and load services. Organizations can use these services to build integrated data pipelines in the cloud.Which cloud job has highest salary? ›
- According to a report by Statista, Senior Solutions Architects earn an average of $141,000, making it the highest paying job.
- The cloud architect position is also pretty lucrative with an average salary of $135,977.
Microsoft Azure Fundamentals certified professionals can earn an average salary of Rs. 9,00,000 per year in India. So clearly, in the AWS vs Azure salary comparison, the salary of an AWS-certified IT professional exceeds that of an Azure-certified IT professional.Who earns more AWS or Azure? ›
According to the data, AWS professionals, on average, earn ₹6.3 lakhs per year, whereas Azure professionals earn around ₹6.1 Lakhs per year. Along with higher salaries, AWS is easier to learn as it has easier documentation.