confluent kafka vs apache beam
Confluent Kafka and Apache Beam are both open-source platforms for streaming data. However, they have different strengths and weaknesses.
Confluent Kafka is a distributed streaming platform that is used to store and process large amounts of data in real time. It is a good choice for applications that require high throughput and low latency. Kafka is also a good choice for applications that need to be fault-tolerant and scalable.
Apache Beam is a unified programming model for batch and streaming data processing. It can be used to process data on a variety of platforms, including Apache Spark, Apache Flink, and Google Cloud Dataflow. Beam is a good choice for applications that need to be portable and scalable.
AWS Lambda and GCP Cloud
AWS Lambda and Google Cloud Run are both serverless computing platforms that allow you to run code without provisioning or managing servers. However, there are some key differences between the two platforms:
- Supported languages: AWS Lambda supports a wide range of programming languages including Node.js, Java, Python, Go, Ruby, and C#. Cloud Run supports Docker images, which can be written in any language.
- Cold start: When a Lambda function is first invoked, it takes a few milliseconds to start up. This is known as a cold start. Cloud Run also has a cold start, but it is typically shorter than Lambda’s.
- Concurrency: Lambda functions are limited to a maximum of 100 concurrent executions. Cloud Run has no such limit, so you can scale your applications more easily.
- Pricing: AWS Lambda charges you based on the amount of memory your function uses and the number of times it is invoked. Cloud Run charges you based on the amount of CPU and memory your container uses.
Feature | AWS Lambda | Google Cloud Run |
---|---|---|
Supported languages | Node.js, Java, Python, Go, Ruby, C# | Docker images (any language) |
Cold start | A few milliseconds | Typically shorter than Lambda’s |
Concurrency | Maximum of 100 concurrent executions | No limit |
Pricing | Based on memory usage and number of invocations | Based on CPU and memory usage |
I recommend trying both and seeing which one works better for you.
Cloud gotchas 2
Serverless
Serverless is great. You create your services and hand them over to AWS Lambda/GCP Cloud Run/Azure Functions and let them rip. Your system can scale up to hundreds of instances and quickly service your clients. However, you must consider
- how will your downstream clients respond to such peaks in volume? Will they be able to cope?
- how must will auto-scaling cost?
- how portable is your code between serverless platforms?
- how will you handle bugs in the serverless platform? You can file a support ticket however this is unlikely to go down well with your users.
Azure create K8 cluster
Here is a Terraform file that you can use to create a Kubernetes cluster in Azure:
provider "azurerm" {
version = "~> 3.70.0"
subscription_id = var.azure_subscription_id
client_id = var.azure_client_id
client_secret = var.azure_client_secret
tenant_id = var.azure_tenant_id
}
resource "azurerm_resource_group" "aks_cluster" {
name = var.resource_group_name
location = var.location
}
resource "azurerm_kubernetes_cluster" "aks_cluster" {
name = var.aks_cluster_name
location = azurerm_resource_group.aks_cluster.location
resource_group_name = azurerm_resource_group.aks_cluster.name
node_count = 3
vm_size = "Standard_D2s_v3"
network_profile {
kubernetes_network_interface_id = azurerm_network_interface.aks_cluster_nic.id
}
default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_D2s_v3"
}
}
resource "azurerm_network_interface" "aks_cluster_nic" {
name = var.aks_cluster_nic_name
location = var.location
resource_group_name = azurerm_resource_group.aks_cluster.name
ip_configuration {
name = "primary"
subnet_id = azurerm_subnet.aks_cluster_subnet.id
address_prefix = "10.0.0.0/24"
}
}
resource "azurerm_subnet" "aks_cluster_subnet" {
name = var.aks_cluster_subnet_name
resource_group_name = azurerm_resource_group.aks_cluster.name
virtual_network_name = var.virtual_network_name
address_prefix = "10.0.0.0/24"
}
resource "azurerm_virtual_network" "aks_cluster_vnet" {
name = var.virtual_network_name
location = var.location
resource_group_name = azurerm_resource_group.aks_cluster.name
address_space = ["10.0.0.0/16"]
}
This Terraform file will create a new Azure resource group, a Kubernetes cluster, a virtual network, and a subnet. The Kubernetes cluster will have three nodes, each of which will be a Standard_D2s_v3 VM. The virtual network and subnet will be created in the same region and resource group as the Kubernetes cluster.
AWS vs Azure vs GCP
AWS, Azure, and GCP are the three leading cloud computing platforms in the market. They offer a wide range of services, including compute, storage, databases, networking, machine learning, and artificial intelligence.
Here are some of the key differences between the three platforms:
- Market share: AWS is the market leader, with a 33% market share in 2022. Azure is second with a 22% market share, and GCP is third with a 9% market share.
- Number of services: AWS offers the most services, with over 200. Azure offers over 100 services, and GCP offers over 60 services.
- Pricing: AWS is generally the most expensive platform, followed by Azure and GCP. However, AWS also offers the most flexible pricing options.
- Focus: AWS is known for its broad range of services. Azure is focused on enterprise customers and government agencies. GCP is focused on startups and developers.
- Innovation: AWS is known for its innovation, and it often introduces new services before its competitors. Azure and GCP are also investing in innovation, but they may not be as quick to market as AWS.
Ultimately, the best cloud computing platform for you will depend on your specific needs and requirements. If you need a wide range of services and are willing to pay a premium, then AWS is a good choice. If you are an enterprise customer or government agency, then Azure may be a better fit. And if you are a startup or developer, then GCP is a good option.
Cloud gotchas 1
Since 2017 I’ve been involved in a wide variety of “cloud” projects and there’s some common myths I’ve observed.
Migrations are just containers
Change is hard and unless you’re working for a startup, most cloud transformations start as lift and shift exercises. Contracts have been signed and everyone has been sold the myth that all you need to do is “dockerise” your containers and away you go.
Unfortunately, most of the hyperscalers (cloud provider - GCP, AWS, Azure, etc) will dazzle you with the way they’ve been doing things for years and just tell you and will instruct you to “do as they say”. However, for most regulated institutions there’s far stricter governance around things like Disaster Recovery and Data locality. For example, on a recent project we discovered that a certain cloud provider had two data centres located less than 50 miles apart. This simply wasn’t good enough for the regulated entity, a natural disaster could easily wipe out both data centers. I was amazed.
Reverse engineering an existing GCP project with terraformer
It can be tough to try to reverse engineer an existing project that has never used terraform. Terraformer can look at an existing project and generate the corresponding terraform code for you. I tried it out on an existing legacy project which used Google Cloud Storage, BigQuery and various service accounts. The setup was a little tricky so I put together a script to simply things. The script assumes you have gcloud setup or a service account key/impersonation and you may need to adjust the –resources parameter.
Raspberry Pi/Raspbian - chromium/chromedriver crash after upgrade to 99.0.4844.51
Upgraded to chromedriver 99.0.4844.51 on Raspbian(bullseye) and seeing this in your chromedriver.log?
[0312/111354.689372:ERROR:egl_util.cc(74)] Failed to load GLES library: /usr/lib/chromium-browser/libGLESv2.so: /usr/lib/chromium-browser/libGLESv2.so: cannot open shared object file: No such file or directory [0312/111354.709636:ERROR:viz_main_impl.cc(188)] Exiting GPU process due to errors during initialization [0312/111354.735541:ERROR:gpu_init.cc(454)] Passthrough is not supported, GL is disabled, ANGLE is
Add “–disable-gpu” as an option when setting up the browser. e.g. for selenium/java:
ChromeOptions options = new ChromeOptions() options.addArguments("–disable-gpu")
It looks like the behaviour has changed as this “shouldn’t” be required. More about flags here
Undelete bigquery table
One hour ago:
bq cp mydataset.table@-3600000 mydataset.table_restored
Absolute (ms since UNIX epoch) GMT: Wednesday, 26 May 2021 13:41:53 = 1622036513000 https://www.epochconverter.com/
bq cp mydataset.table@1622036513000 mydataset.table_restored
More on Bigquery time travel