If I say Today’s Economy is a data Economy it will not be a wrong statement. In every walk of the life
there are activities, transactions generating information and storage of data. This stored data could
provide tactital reports i.e. about already that has happened which is of little use in the decision
making. Therefore the branch in data science which has gained popularity recently is Data Analytcis
as it not only proivides you insights on what happened but on the basis of certain logic predicts what
may happen or what to expect ? This is utterly critical for the businesses in the VUCA situation today
not only to survive but grow too. Therefor the Data Analytcis is extremely demanding subject from
business and would walk you through on all aspects to Data Analytics.
Data Engineering and Analytics is a complex subject, it has several compoents which involves diverse
set of technologies and processing. Pl. find below the Data Engineering activities.
There are three Steps /parts or components of end to end Data Engineering. It is a journey from
Data sources, data processing, data transformation, Data Warehouse/ Data Lake set up, Data upload
and finally getting the Data Warehouse/ Lake ready for the various analytics actions:
-
ELT/ETL Software Engineering: This is the first step in the Data Engineering but very
important process and Data Analytics success or failure depends upon this core step. In
this process of software engineering Data is extracted from the various sources, it is
transformed and loaded or some times loaded and transformed. Imagine your
organization has multiple data sources such as Oracle, DB2, SQL, text files etc. which are
supporting multiple applications/ products. The Data Architect will define what data,
how much data from these sources will be extracted, transformed and uploaded into
Data lake/ warehouses. The transformation steps could involve Data cleansing,
formatting, removing duplicates etc. There are tools like Pentaho, Google Data Flow,
Azure Data Factory, AWS glue which are commonly used ETL tools, whereas tools like
Talend, Airflow, Hevo data are the commonly used ELT tools for the process. - Data Engineering for Data Lakes, Warehouses: Data Architect has to design and also
decide upon the best technology choices for the Data Warehouse or Data lakes. It is
important to establish the data relationships, understand the data attributes,
understand how the data will be consumed etc. etc. as an input to the DWH or Data
lakes design. Once warehouses /lakes are ready with data loaded through ELT/ETL
process it is ready for the Analytics layer to consume it. Snowflake, Google Data
Warehouse tool, MS Azure Data Warehouse tool, IBM and Oracle Data Warehouse tools
are commonly used tools for Data Warehousing. Azure Data Lake Storage, AWS Lake
formation, Qubole, Infor Data Lake are the commonly used tools to build the Data Lake. - Data Visualisation/Analytics: Post the data has been made available in the warehouses
/ data lakes, this is the third step in the Data Engineering where in there could be many
possibilities:- Applications – Some special applications could consume it which needs data from
various sources together in a relationship. - Data Analytics – Tools like Power BI, Qlik, Tableau etc. can be now implemented on
the top of the collected data in the warehouses to perform the necessary analytics
activities. - Data Scientists – The Data collected now can be used / analysed by Data Scientists
to infer patterns, understand what is data saying and recommend decisions etc.
Thus, Data visualization is the representation of data through use of common graphics,
such as charts, plots, infographics, and even animations. These visual displays of
information communicate complex data relationships and data-driven insights in a way
that is easy to understand. - Applications – Some special applications could consume it which needs data from
By 2025, worldwide, organisations will generate 463 exabytes of data. If you want to utilize it and
build a data-driven culture in your organization, you’ll need to understand challenges in data
analytics and methods to overcome these data analytics challenges. Based on my experience sharing
what may work better to address these challenges
- Collecting Meaningful data: Due to huge data getting generated it may overwhelm
employees. They may hence analyse the data which is readily available and not the one
which is really critical. It will certainly not help.
Solution: Possibly deploy a Data Analyst and also get the data literacy improved so
employees know what to work upon and what is critical data to business. - Selecting the right tools: There are loads of tools available so which one should selected for
ETL, Data Warehousing, Analytics is a difficult decision and anything which is not debated for
pros and cons may not give the desired results.
Solution: One can use expert consulting advisory or form a core group comprising of
business, IT leadership to evaluate the right tools. The Design has to be viewed not in the
silos but has to be thought through end to end. The Data Architect has to think what will be
tool chain, how the handshake across technologies and tools will happen, what will be the
right choice of the tool combinations from ETL to Data warehouse to analytics chain. - Consolidation of the Data from Multiple sources: Data comes from scattered and disjointed
sources. For instance, you will need to pull data from your website, social media pages, CRMportals, financial reports, e-mails, competitors’ websites, etc. The data formats of most of
these reports will obviously vary. Putting them at one common place and analysing is a
challenge.
Solution: One central data hub or Data Warehouse can be created to put the data at one
location with a relationship established as the need be. This is decision to be made by Data
Architect upfront in the ETL/ELT phase of software engg. - Data Quality: It is the most important issue and affects all the activities downstream. Due to
data updates in one application at one place and not being updated everywhere cerates
data consistency errors. Due to manual data entry too there could be possible errors of data
quality. Also no validation logic in the data uploads, data creation can get wrong data or
corrupt data inside the storage.
Solution: As far as possible make sure there are no manual data entry points, Data validation
should always be there at various stages ensuring data is in-line with the design. Wherever
possible get the data uploads automated. Data Synchronization through check and balances
should be designed. - Building Data Culture among the employees: According to a study, the biggest obstacle in
becoming a data-driven company lies in an organization’s culture and not technologies. Only
a meager 9.1% of executives have pointed out technology as a challenge in the path of data
analysis. Many times, though top-level understand the importance of data analysis, they do
not extend the desired support to their employees. Constant pressure and lack of support
from the top and lower-level employees are among the most significant data analytics
challenges.
Solution: Up-skilling of the employees on the data and tooling, training them on the
importance of the data, recognising the innovative solutions they may come out with are all
the actions which will help improve the data culture of the organisation. - Data Security: Since different types of data is being collated at one place which includes
business sensitive data, employee data etc. Unprotected data sources can become an easy
entry point for hackers. Also the access to sensitive information could create huge issues and
affect the business.
Solution: Data Privacy and Protection Policy needs to be defined and implemented across
the organisation. Data must be encrypted while getting transmitted across the networks, the
data access must be authenticated with company defined security measures. There has to
be frequent audits on the data security measures and current state. Any violation must be
dealt with strict actions. Beyond these, there has be physical access control, strict adherence
to system access control measures, no access via external attachable devices like USB, Disks
to the developer machines, no cloud data uploads etc. etc. should be followed.