Training

To empower you developers and data analysts we are providing a set of courses, ranging from beginner to intermediate to advanced topics.

Following courses are available:

Together we define a training Curriculum to enable your developers and data analysts. On demand, we also provide courses about specific topics not covered by our courses.

Coaching helps smoothing the adoption of Big Data technologies by your developers and data analysts. DataCrunchers can assist your employees either on site or remotely. Get assisted on your first project or on more advanced tasks later on.

Apache Spark

Big Data is the hype of the moment in ICT and marketing. Since its inception in 2007, Apache Hadoop has been looked at as the de facto standard for the storage and processing of big data volumes in batch.

But every technology has its limitations, and this is no different for Hadoop: it is batch-oriented and the MapReduce framework is too limited for handling all types of data analysis within the same technology stack.

Because the volume and speed of data generation gradually increases, so does the need for faster data processing and analysis to answer the needs and expectations of end users.

IBM calls Apache Spark “most important new open source project in a decade”. Apache Spark solves the problem of speed and versatility by offering an “open source data analytics cluster computing framework”. Spark was developed in 2009 at the AMPLab (Algorithms, Machines, and People Lab) of the University of California in Berkeley, and donated to the open source community in 2010. It is faster than Hadoop, in some cases 100 times faster, and it offers a framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning. During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications.

Agenda

Day 1: Spark Core
  • What is Apache Spark?
  • Just enough Scala
  • Spark Core API
    • Spark Shell
    • SparkContext
    • Spark Master
    • RDD
    • Transformations & Actions
    • Caching
    • Shared Variables: Accumulators & Broadcast variables
    • Spark Applications
    • Spark Execution Model
  • Spark Notebooks
Day 2: Spark Modules
  • Spark Core (Continued)
  • Spark SQL
  • Spark Streaming
Day 3: Advanced Spark
  • Big Data Architectures
  • Putting it all together (Batch & Streaming)
  • Performance Tuning
  • Operational Tips
  • Mllib & GraphX

The Hadoop Ecosystem

The rise of the internet, social media and mobile technologies and in the very near future the Internet of Things ensures that our data footprint is rising fast. Companies like Google and Facebook were quickly confronted with massive data sets, this led to a new way of thinking about data. Hadoop provides an open source solution based on the same technology used within Google. It allows you to store and analyze in a scalable way huge amounts of data to create new insights.

With this workshop we want to give everyone the opportunity to get acquainted with the Hadoop Ecosystem.

This course can be booked with exercises (2 days) or without (1 day)

Agenda

  • Introduction
  • What is Big Data?
    • Volume, Variety, Velocity
    • Business Drivers
    • Technical Drivers
    • Big Data Evolution
  • The Hadoop Ecosystem
    • What is Hadoop?
    • Hadoop Services
    • Hadoop Distributions
  • Storage
    • HDFS
    • HBASE
    • Kudu
    • Data Modelling
  • Processing
    • MapReduce
    • Hive
    • Pig
    • Spark
    • Yarn
  • Integration
    • Sqoop
    • Flume
    • Kafka
  • Indexing
    • ElasticSearch
  • Big Data Architectures
    • Architectures
      • Lambda
      • Kappa
      • Zeta
    • Trends