Training

To empower you developers and data analysts we are providing a set of courses, ranging from beginner to intermediate to advanced topics.

Following courses are available:

Together we define a training Curriculum to enable your developers and data analysts. On demand, we also provide courses about specific topics not covered by our courses.

Coaching helps smoothing the adoption of Big Data technologies by your developers and data analysts. DataCrunchers can assist your employees either on site or remotely. Get assisted on your first project or on more advanced tasks later on.

Apache Spark

Big Data is the hype of the moment in ICT and marketing. Since its inception in 2007, Apache Hadoop has been looked at as the de facto standard for the storage and processing of big data volumes in batch.

But every technology has its limitations, and this is no different for Hadoop: it is batch-oriented and the MapReduce framework is too limited for handling all types of data analysis within the same technology stack.

Apache Spark makes big data easy to implement, it was developed in 2009 at the AMPLab (Algorithms, Machines, and People Lab) of the University of California in Berkeley, and donated to the open source community in 2010. It is faster than Hadoop, in some cases 100 times faster, and it offers a framework that supports different types of data analysis within the same technology stack: fast interactive queries, streaming analysis, graph analysis and machine learning. During this two-day hands-on workshop, we discuss the theory and practice of several data analysis applications.

Notebook technology (Zeppelin, Jupyter, Spark Notebook, Databricks Cloud, …) allow you to go from prototypes into production workflows in one go. Notebooks allow to implement “repeatable research” by mixing executable code with comments, images, tables, links, …

We’ve choosen Databricks Cloud as notebook technology because it is the most mature enterprise-ready notebook technology on the market at this moment. It’s implemented on top of AWS and apparently Azure support is on the roadmap as well.

This course supports Spark 2.x

Agenda

Day 1: Spark Basics & RDDs
  • What is Apache Spark?
  • Notebooks are coming
  • Just enough Scala
  • Spark Basics
    • Spark 2.x
    • A tale of 3 APIs
    • Spark Shell
    • SparkContext
    • Spark Master
    • RDD
    • Transformations & Actions
    • Caching
    • Shared Variables: Accumulators & Broadcast variables
    • Spark Applications
    • Spark Execution Model
Day 2: Spark SQL, DataFrames & Datasets
  • Spark SQL, DataFrames & Datasets
    • Introduction: RDDs vs DataFrames vs Datasets
    • Basic DataFrame Operations
    • Different Types of Data
    • Aggregations
    • Joins
    • Data Sources
    • SQL
    • Datasets

Slide examples are made available in Databricks Cloud notebooks, as such students can execute the samples while following the course.

All exercises are performed using Databricks Cloud, the solution notebooks are given at the end of the course.

Contact me regarding prices for groups, schedule etc at geert@datacrunchers.eu

I am giving this course throughout Europe.

The Hadoop Ecosystem

The rise of the internet, social media and mobile technologies and in the very near future the Internet of Things ensures that our data footprint is rising fast. Companies like Google and Facebook were quickly confronted with massive data sets, this led to a new way of thinking about data. Hadoop provides an open source solution based on the same technology used within Google. It allows you to store and analyze in a scalable way huge amounts of data to create new insights.

With this workshop we want to give everyone the opportunity to get acquainted with the Hadoop Ecosystem.

This course can be booked with exercises (2 days) or without (1 day)

Agenda

  • Introduction
  • What is Big Data?
    • Volume, Variety, Velocity
    • Business Drivers
    • Technical Drivers
    • Big Data Evolution
  • The Hadoop Ecosystem
    • What is Hadoop?
    • Hadoop Services
    • Hadoop Distributions
  • Storage
    • HDFS
    • HBASE
    • Kudu
    • Data Modelling
  • Processing
    • MapReduce
    • Hive
    • Pig
    • Spark
    • Yarn
  • Integration
    • Sqoop
    • Flume
    • Kafka
  • Indexing
    • ElasticSearch
  • Big Data Architectures
    • Architectures
      • Lambda
      • Kappa
      • Zeta
    • Trends