forwards: architecture

Showing posts with label architecture. Show all posts

Saturday, April 8, 2023

Cassandra

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Tuesday, January 17, 2023

Kappa Architecture - SPA

Kappa Architecture

• The idea is to handle both real-time data processing and continuous reprocessing in a single

stream processing engine.

• This requires that the incoming data stream can be replayed (very quickly), either in its entirety

or from a specific position.

• If there are any code changes, then a second stream process would replay all previous data

through the latest real-time engine and replace the data stored in the serving layer.

• This architecture attempts to simplify by only keeping one code base rather than manage one

for each batch and speed layers in the Lambda Architecture.

• In addition, queries only need to look in a single serving location instead of going against batch

and speed views.

Pros

• Kappa architecture can be used to develop data systems that are online learners and

therefore don’t need the batch layer.

• Re-processing is required only when the code changes.

• It can be deployed with fixed memory.

• It can be used for horizontally scalable systems.

• Fewer resources are required as the machine learning is being done on the real time basis.

Cons

• Absence of batch layer might result in errors during data processing or while updating the

database that requires having an exception manager to reprocess the data or reconciliation.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Lambda Architecture -- SPA

Basic Flow of Events Of Lamda Architecture

All data entering the system is dispatched to both the batch layer and the speed layer for

processing.

2. The batch layer has two functions:

(i) managing the master dataset (an immutable, append-only set of raw data)

(ii) to pre-compute the batch views.

3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc

way.

4. The speed layer compensates for the high latency of updates to the serving layer and deals

with recent data only.

5. Any incoming query can be answered by merging results from batch views and real-time views

Batch Layer

New data comes continuously, as a feed to the data system.

• It gets fed to the batch layer and the speed layer simultaneously.

• It looks at all the data at once and eventually corrects the data in the stream layer.

• Here we can find lots of ETL and a traditional data warehouse.

• This layer is built using a predefined schedule, usually once or twice a day.

• The batch layer has two very important functions:

• To manage the master dataset

• To pre-compute the batch views.

Speed Layer (Stream Layer)

• This layer handles the data that are not already delivered in the batch view due to the latency of

the batch layer.

• In addition, it only deals with recent data in order to provide a complete view of the data to the

user by creating real-time views.

• Speed layer provides the outputs on the basis enrichment process and supports the serving

layer to reduce the latency in responding the queries.

• As obvious from its name the speed layer has low latency because it deals with the real time

data only and has less computational load.

Serving Layer

• The outputs from batch layer in the form of batch views and from speed layer in the form of

near-real time views are forwarded to the serving layer.

• This layer indexes the batch views so that they can be queried in low-latency on an ad-hoc

basis.

Pros

• Batch layer of Lambda architecture manages historical data with the fault tolerant

distributed storage which ensures low possibility of errors even if the system crashes.

• It is a good balance of speed and reliability.

• Fault tolerant and scalable architecture for data processing.

Cons

• It can result in coding overhead due to involvement of comprehensive processing.

• Re-processes every batch cycle which is not beneficial in certain scenarios.

• A data modelled with Lambda architecture is difficult to migrate or reorganize.

----------------------------------------------------------------------------

All the messages

below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Generalized Architecture - SPA

Streaming Data System Architecture Components

• Collection

• Data Flow

• Processing

• Storage

• Delivery

Collection System

• Mostly communication over TCP/IP network using HTTP

• Websites log data was the initial days use case

• W3C standard log data format was used

• Newer formats like JSON, AVRO, Thrift are available now

• Collection happens at specialized servers called edge servers

• Collection process is usually application specific

• New servers integrates directly with data flow systems

• Old servers may or may not integrate directly with data flow systems

Data Flow Tier

• Separation between collection tier and processing layer is required

• Rates at which these systems works are different

• What if one of system is not able to cope with another system?

• Required intermediate layer that takes responsibility of

• accepting messages / events from collection layer

• providing those messages / events to processing layer

• Real time interface to data layer for both producers and consumers of data

• Helps in guaranteeing the “at least once” semantics

Processing / Analytics Tier

• Based on “data locality” principle

• Move the software / code to a the location of data

• Rely on distributed processing of data

• Framework does the most of the heavy lifting of data partitioning, job scheduling, job managing

• Available Frameworks

• Apache Storm

• Apache Spark (Streaming)

• Apache Kafka Streaming etc

Storage Tier

In memory or permanent

• Usually in memory as data is processed once

• But can have use cases where events / outcomes needs to be persisted as well

• NoSQL databases becoming popular choice for permanent storage

• MongoDB

• Cassandra

• But usage varies as per the use case, still no database that fits all use cases

Delivery Layer

• Usually web based interface

• Now a days mobile interfaces are becoming quite popular

• Dashboards are built with streaming visualizations that gets continuously updated as

underlying events are processed

• HTML + CSS + Java script + Websockets can be used to create interfaces and update

them

• HTML5 elements can be used to render interfaces

• SVG, PDF formats used to render the outcomes

• Monitoring / Alerting Use cases

• Feeding data to downstream applications

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Thursday, January 5, 2023

Hadoop 2 - Architecture

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

MapReduce Programming Architecture and flow

1.Input dataset is split into multiple pieces of data (several small sets)

2.Framework creates a master and several worker processes and executes the worker processes remotely

3.Several Map tasks work simultaneously and read pieces of data that were assigned to each map. Map worker uses the

map function to extract only those data that are present on their server and generates key/value pair for the extracted

data.

4.Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer should get the

output of specified mapper.

5.When the map workers complete their work, the master instructs the reduce workers to begin their work.

6.The reduce workers in turn contact the map workers to get the key/value data for their partition (shuffle). The data thus

received from various mappers is merge sorted as per keys.

7.Then it calls reduce function on every unique key. This function writes output to the file.

8.When all the reduce workers complete their work, the master transfers the control to the user program.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, January 4, 2023

Big Data Architecture Challenges

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data architecture style

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.