Planet Apache

Syndicate content
Updated: 10 hours 28 min ago

Bryan Pendleton: Yes

Thu, 2017-12-14 22:55

Yes, yes, yes, yes, yes, yes.


Yes, yes, yes.

And, at the very end, the very last line, most definitely: yes.

Categories: FLOSS Project Planets

Bryan Pendleton: The Likeness: a very short review

Thu, 2017-12-14 21:54

The Likeness is the second in Tana French's Dublin Murder Squad series of mystery novels.

Although The Likeness is not quite as great as French's thoroughly superb first book, it is still quite good indeed, and I devoured it apace.

The characters are fascinating; the scenario is very intriguing; the pacing and reveal is just right.

But perhaps most importantly, French's wonderfully lyrical touch again does not fail her.

Here we are, mid-story, just as our hero is learning something new about a crucial character:

The garden dumbstruck, in the fading gold light. The birds hushed, the branches caught in midsway; the house, a great silence poised over us, listening. I had stopped breathing. Lexie blew down the grass like a silver shower of wind, she rocked in the hawthorn trees and balanced light as a leaf on the wall beside me, she slipped along my shoulder and blazed down my back like fox fire.

I love the way this passage depicts how "time stops" sometimes, when you suddenly realize something new.

I love the way this passage depicts the way that evidence can have a voice of its own, making inanimate artifacts come to life.

I love the way this passage evokes the spirit of a departed human soul, simultaneously here and not here.

And I love the beautiful way she makes us feel our own spine tingle.

There's plenty of good solid policework, of course. And plenty of action, and plenty of evidence, and plenty of mystery.

But there's a wonderful amount of this, too:

I listened to the static echoing in my ear and thought of those herds of horses you get in the vast wild spaces of America and Australia, the ones running free, fighting off bobcats or dingoes and living lean on what they find, gold and tangled in the fierce sun. My friend Alan from when I was a kid, he worked on a ranch in Wyoming one summer, on a J1 visa. He watched guys breaking those horses. He told me that every now and then there was one that couldn't be broken, one wild to the bone. Those horses fought the bridle and the fence till they were ripped up and streaming blood, till they smashed their legs or their necks to splinters, till they died of fighting to run.

Of course, she isn't really talking about horses at all.

I can't wait to read more of her books.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-14

Thu, 2017-12-14 18:58
Categories: FLOSS Project Planets

Bertrand Delacretaz: Open source is done. Welcome to Open Development!

Thu, 2017-12-14 03:30

I originally published this article on SD Times, republishing it to keep it around for posterity…

If you’re looking at embracing open source today, you might be a bit late to the game. Using open-source software is mainstream now, and being involved in open-source projects is nothing to write home about either. Everybody does it, we know how it works, its value is proven.

But what’s next? Sharing source code openly is a given in open-source projects, but in the end it’s only about sharing lines of text. The real long-term power of successful open-source projects lies in how their communities operate, and that’s where open development comes in.

Shared communications channels. Meritocracy. Commit early, commit often. Back your work by issues in a shared tracker. Archive all discussions, decisions and issues about your project, and make that searchable. All simple principles that, when combined, make a huge difference to the efficiency of our corporate projects.

But, of course, the chaotic meritocracies of open-source projects won’t work for corporate teams, right? Such teams require a chain of command with strictly defined roles. Corporate meritocracy? You must be kidding.

I’m not kidding, actually: Open development works very well in corporate settings, and from my experience in both very small and fairly large organizations, much better than old-fashioned top-to-bottom chains of command and information segregation principles. Empower your engineers, trust them, make everything transparent so that mistakes can be caught early, and make sure the project’s flow of information is continuous and archived. Big rewards are just around the corner—if you dive in, that is.

What’s open development?
Open development starts by using shared digital channels to communicate between project members, as opposed to one-to-one e-mails and meetings. If your team’s e-mail clients are their knowledge base, that will go away with them when they leave, and it’s impossible for new project members to acquire that knowledge easily.

A centralized channel, like a mailing list, allows team members to be aware of everything that’s going on. A busy mailing list requires discipline, but the rewards are huge in terms of spreading knowledge around, avoiding duplicate work and providing a way for newcomers to get a feel for the project by reviewing the discussion archives. At the Apache Software Foundation, we even declare that “If it didn’t happen on the dev list, it didn’t happen,” which is a way of saying that whatever is worth saying must be made available to all team members. No more discrepancies in what information team members get; it’s all in there.

The next step is sharing all your code openly, all the time, with all stakeholders. Not just in a static way, but as a continuous flow of commits that can tell you how fast your software is evolving and where it’s going, in real time.

Software developers will sometimes tell you that they cannot show their code because it’s not finished. But code is never finished, and it’s not always beautiful, so who cares? Sharing code early and continuously brings huge benefits in terms of peer reviews, learning from others, and creating a sense of belonging among team members. It’s not “my” code anymore, it’s “our” code. I’m happy when someone sees a way to improve it and just does it, sometimes without even asking for permission, because the fix is obvious. One less bug, quality goes up, and “shared neurons in the air” as we sometimes say: all big benefits to a team’s efficiency and cohesion.

Openly sharing the descriptions and resolutions of issues is equally important and helps optimize usage of a team’s skills, especially in a crisis. As in a well-managed open-source project, every code change is backed by an issue in the tracker, so you end up with one Web page per issue, which tells the full history of why the change was made, how, when, and by whom. Invaluable information when you need to revisit the issue later, maybe much later when whoever wrote that code is gone.

Corporate projects too often skip this step because their developers are co-located and can just ask their colleague next door directly. By doing that, they lose an easy opportunity to create a living knowledgebase of their projects, without much effort from the developers. It’s not much work to write a few lines of explanation in an issue tracker when an issue is resolved, and, with good integration, rich links will be created between the issue tracker and the corresponding source code, creating a web of valuable information.

The dreaded “When can we ship?” question is also much easier to answer based on a dynamic list of specific issues and corresponding metadata than by asking around the office, or worse, having boring status meetings.

The last critical tool in our open development setup is in self-service archives of all that information. Archived mailing lists, resolved issues that stay around in the tracker, source-code control history, and log messages, once made searchable, make project knowledge available in self-service to all team members. Here as well, forget about access control and leave everything open. You want your engineers to be curious when they need to, and to find at least basic information about everything that’s going on by themselves, without having to bother their colleagues with meetings or tons of questions. Given sufficient self-service information, adding more competent people to a project does increase productivity, as people can largely get up to speed on their own.

While all this openness may seem chaotic and noisy to the corporate managers of yesterday, that’s how open-source projects work. The simple fact that loose groups of software developers with no common boss consistently produce some of the best software around should open your eyes. This works.

Categories: FLOSS Project Planets

Aaron Morton: Should you use incremental repair?

Wed, 2017-12-13 19:00

After seeing a lot of questions surrounding incremental repair on the mailing list and after observing several outages caused by it, we figured it would be good to write down our advices in a blog post.

Repair in Apache Cassandra is a maintenance operation that restores data consistency throughout a cluster. It is advised to run repair operations at leasts every gc_grace_seconds to ensure that tombstones will get replicated consistently to avoid zombie records if you perform DELETE statements on your tables.

Repair also facilitates recovery from outages that last longer than the hint window, or in case hints were dropped. For those operators already familiar with the repair concepts, there were a few back-to-basics moments when the behavior of repair changed significantly in the release of Apache Cassandra 2.2. The introduction of incremental repair as the default along with the generalization of anti-compaction created a whole new set of challenges.

How does repair work?

To perform repairs without comparing all data between all replicas, Apache Cassandra uses merkle trees to compare trees of hashed values instead.

During a repair, each replica will build a merkle tree, using what is called a “validation compaction”. It is basically a compaction without the write phase, the output being a tree of hashes.

Merkle trees will then be compared between replicas to identify mismatching leaves, each leaf containing several partitions. No difference check is made on a per partition basis : if one partition in a leaf is not in sync, then all partitions in the leaf are considered as not being in sync. When more data is sent over than is required it’s typically called overstreaming. Gigabytes of data can be streamed, even for one bit of difference. To mitigate overstreaming, people started performing subrange repairs by specifying the start/end tokens to repair by smaller chunks, which results in having less partitions per leaf.

With clusters growing in size and density, performing repairs within gc_grace_seconds started to get more and more challenging, with repairs sometimes lasting for tens of days. Some clever folks leveraged the immutable nature of SSTables and introduced incremental repair in Apache Cassandra 2.1.

What is incremental repair?

The plan with incremental repair was that once some data had been repaired, it would be marked as such and never needed to be repaired anymore.
Since SSTables can contain tokens from multiple token ranges, and repair is performed by token range, it was necessary to be able to separate repaired data from unrepaired data. That process is called anticompaction.

Once a repair session ends, each repaired SSTable will be split into 2 SSTables : one that contains the data that was repaired in the session (ie : data that belonged to the repaired token range) and another one with the remaining unrepaired data. The newly created SSTable containing repaired data will be marked as such by setting its repairedAt timestamp to the time of the repair session.
When performing validation compaction during the next incremental repair, Cassandra will skip the SSTables with a repairedAt timestamp higher than 0, and thus only compare data that is unrepaired.

Incremental repair was actually promising enough that it was promoted as the default repair mode in C* 2.2, and anticompaction was since then also performed during full repairs.
To say the least, this was a bit of a premature move from the community as incremental repair has a few very annoying drawbacks and caveats that would make us consider it an experimental feature instead.

The problems of incremental repair

The most nasty one is filed in the Apache Cassandra JIRA as CASSANDRA-9143 with a fix ready for the unplanned 4.0 release. Between validation compaction and anticompaction, an SSTable that is involved in a repair can be compacted away as part of the standard compaction process on one node and not on the others. Such an SSTable will not get marked as repaired on that specific node while the rest of the cluster will consider the data it contained as repaired.
Thus, on the next incremental repair run, all the partitions contained by that SSTable will be seen as inconsistent and it can generate a fairly large amount of overstreaming. This is a particularly nasty bug when incremental repair is used in conjunction with Level Compaction Strategy (LCS). LCS is a very intensive strategy where SSTables get compacted way more often than with STCS and TWCS. LCS creates fixed sized SSTables, which can easily lead to have thousands of SSTables for a single table. The way streaming occurs in Apache Cassandra during repair makes that overstreaming of LCS tables could create tens of thousands of small SSTables in L0 which can ultimately bring nodes down and affect the whole cluster. This is particularly true when the nodes use a large number of vnodes.
We have seen happening on several customers clusters, and it requires then a lot of operational expertise to bring back the cluster to a sane state.

In addition to the bugs related to incorrectly marked sstables, there is significant overhead of anti-compaction. It was kind of a big surprise for users upgrading from 2.0/2.1 to 2.2 when trying to run repair. If there is already a lot of data on disk, the first incremental repair can take a lot of time (if not forever) and create a similar situation as above with a lot of SSTables being created due to anticompaction. Keep in mind that anticompaction will rewrite all SSTables on disk to separate repaired and unrepaired data.
While it’s not necessary anymore to “prepare” the migration to incremental repair, we would strongly advise against running it on a cluster with a lot of unrepaired data, without first marking SSTables as repaired. This would require to run a full repair first to make sure data is actually repaired, but now even full repair performs anticompaction, so… you see the problem.

A safety measure has been set in place to prevent SSTables going through anticompaction to be compacted, for valid reasons. The problem is that it will also prevent that SSTable from going through validation compaction which will lead repair sessions to fail if an SSTable is being anticompacted. Given that anticompaction also occurs with full repairs, this creates the following limitation : you cannot run repair on more than one node at a time without risking to have failed sessions due to concurrency on SSTables. This is true for incremental repair but also full repair, and it changes a lot of the habit you had to run repair in previous versions.

The only way to perform repair without anticompaction in “modern” versions of Apache Cassandra is subrange repair, which fully skips anticompaction. To perform a subrange repair correctly, you have three options :

Regardless, it is extremely important to note that repaired and unrepaired SSTables can never be compacted together. If you stop performing incremental repairs once you started, you could end up with outdated data not being cleaned up on disk due to the presence of the same partition in both states. So if you want to continue using incremental repair, make sure it runs very regularly, and if you want to move back to full subrange repairs you will need to mark all SSTables as unrepaired using sstablerepairedset.

Note that due to subrange repair not performing anti-compaction, is not possible to perform subrange repair in incremental mode.

Repair : state of the art in late 2017

Here’s our advice at the time of writing this blog post, based on our experience with customers : perform full subrange repair exclusively for now and do not ever run incremental repair. Just pretend that feature does not exist for now.

While the idea behind incremental repair is brilliant, the implementation still has flaws that can cause severe damage to a production cluster, especially when using LCS and DTCS. The improvements and fixes planned for 4.0 will need to be thoroughly tested to prove they fixed incremental repair and allow it to be safely used as a daily routine.

We are confident that future releases will make incremental repair better, allowing the operation to be safe and blazing fast compared to full repairs.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-13

Wed, 2017-12-13 18:58
Categories: FLOSS Project Planets

Colm O hEigeartaigh: A fast way to get membership counts in Apache Syncope

Wed, 2017-12-13 11:47
Apache Syncope is a powerful open source Identity Management project, covered extensively on this blog. Amongst many other features, it allows the management of three core types - Users, Groups and "Any Objects", the latter which can be used to model arbitrary types. These core types can be accessed via a flexible REST API powered by Apache CXF. In this post we will explore the concept of "membership" in Apache Syncope, as well as a new feature that was added for Syncope 2.0.7 which allows an easy way to see membership counts.

1) Membership in Apache Syncope

Users and "Any Objects" can be members of Groups in two ways - statically and dynamically. "Static" membership is when the User or "Any Object" is explicitly assigned membership of a given Group. "Dynamic" membership is when the Group is defined with a set of rules, which if they evaluate to true for a given User or "Any Object", then that User or "Any Object" is a member of the group. For example, a User could be a dynamic member of a group based on the value for a given User attribute. So we could have an Apache group with a dynamic User membership rule of "*" matching an "email" attribute.

2) Exploring group membership via the REST API

Let's examine group membership with some practical examples. Start Apache Syncope and log in to the admin console. Click on "Groups" and add a new group called "employee", accepting the default options. Now click on the "User" tab and add new Users called "alice" and "bob", with static membership of the "employee" group.

Using a tool like "curl", we can access the REST API using the admin credentials to obtain information on "alice":
  • curl -u admin:password http://localhost:9080/syncope/rest/users/alice
Note that "alice" has a "memberships" attribute pointing to the "employee" group. Next we can see information on the "employee" group via:
  • curl -u admin:password http://localhost:9080/syncope/rest/groups/employee
3) Obtaining membership counts

Now consider obtaining the membership count of a given group. Let's say we are interested in finding out how many employees we have - how can this be done? Prior to Apache Syncope 2.0.7, we have to leverage the power of FIQL which underpins the search capabilities of the REST API of Apache Syncope:
  • curl -u admin:password http://localhost:9080/syncope/rest/users?fiql=%24groups==employee
In other words, search for all Users who are members of the "employee" group. This returns a long list of all Users, even though all we care about is the count (which is encoded in the "totalCount" attribute). There is a new way to do this Apache Syncope 2.0.7. Instead of having to search for Users, membership counts are now encoded in groups. So we can see the total membership counts for a given group just by doing a GET call:
  • curl -u admin:password http://localhost:9080/syncope/rest/groups/employee
Following the example above, you should see an "staticUserMembershipCount" attribute with a value of "2". Four new attributes are defined for GroupTO:
  • staticUserMembershipCount: The static user membership count of a given group
  • dynamicUserMembershipCount: The dynamic user membership count of a given group
  • staticAnyObjectMembershipCount: The static "Any Object" membership count of a given group
  • dynamicAnyObjectMembershipCount: The dynamic "Any Object" membership count of a given group.
Some consideration was given to returning the Any Object counts associated with a given Any Object type, but this was abandoned due to performance reasons.
Categories: FLOSS Project Planets

Bryan Pendleton: A winter adventure in Zion

Tue, 2017-12-12 22:19

It was time to go, so we got up and went. We packed our bags, with as much warm clothing as we could reasonably carry, flew to Las Vegas, picked up a nice new rental car (what a nice car the new Toyota Camry is!), and drove northeast on interstate 15.

About 45 minutes out of Vegas, we took a back road recommended by my wife's colleague, which took us about 15 miles off the freeway, into a Nevada State Park named Valley of Fire. After a short stop at the Visitor Center to get our bearings, we found the picnic area named Mouse's Tank, populated by the boldest little ground squirrels you could imagine, practicing their cutest poses to try to convince us to donate some of our lunch to them.

After lunch, we took a short walk to admire the Valley of Fire petroglyphs, which are nothing short of astounding.

Then we were back in the car again, and soon back on I-15, and not long after that we were through Nevada, and had sliced a corner off of Arizona, and were solidly into Utah, before we left the freeway to take Utah State Route 9 east into Zion National Park.

The days are short, this time of year, so even though we got to the park gates at about 5:45 PM, it was already pitch dark, and we crept along the park road quite slowly, peeking around every corner for deer, trying to figure out where our turn was. Zion Lodge is located most of the way up the canyon road, deep in the main canyon, enjoying a location that can stand toe-to-toe with any hotel on the planet for claim to "Most Beautiful Lodge Location".

But, as I say, it was completely dark out, and we were exhausted, so we simply checked into our room (which was wonderful: spacious and elegant), had dinner at the lodge restaurant, and collapsed into bed.

Deep in the main canyon, sunset comes early and sunrise late, particularly this time of year. But up we got, the next morning, and bravely we set out to explore Zion National Park. Lo and behold, as the sun started to crawl slowly down the western walls of the canyon toward the valley floor, we found ourselves nearly alone in a place of tremendous beauty, with nearly as many mule deer as human visitors keeping us company on our explorations.

At the very end of the canyon road, one of the most famous trails is the Riverside Walk, which leads into the section of the Virgin River canyon known as The Narrows, launching spot for those interested in the sport of Canyoneering. We could barely imagine this, for at the time we walked the trail the temperature was 34 degrees, and a steady breeze was blowing, so we were fully encased in every shred of clothing we could layer upon ourselves, but at the trail's end there were nearly a dozen people, of all ages, clad in little more than long-sleeved swimsuits, waterproof hiking boots, and gaiters, setting off confidently into the rapidly-flowing, near-freezing waters of the Virgin River, headed upstream for adventure.

We had decided to work our way, slowly, back down the main canyon, and so we did, stopping to hike the Weeping Rock trail, the Emerald Pools trail, and the Watchman trail, among others, as well as stopping along the road for half an hour or so to watch people hiking up Walter's Wiggles (as well as rock climbing the cliff face below the Angels Landing trail).

No, we didn't do the Angels Landing trail. Yes, it's true: I ruled it out from the start. Uh, here's the reason why.

By the end of our first day, we were well and thoroughly exhausted, but also extremely pleased with the day.

There's just nothing like the experience of spending an entire day in a National Park: waking up in the park, spending all day in and around the park, and then remaining in the park when all the daily visitors go home, and it's just you lucky few. And the mule deer.

Once again we woke up the next morning in complete darkness, and made our way over to the lodge for breakfast, with aching muscles yet still aching for more.

Zion National Park is fairly large, even though compared to some national parks it's not gigantic, and I was hungry to see as much of the park as I could.

So we popped into the car and drove up Kolob Terrace Road, which in barely a dozen miles took us up from the 3,500 foot river elevation to the 7,000 foot elevation of Upper Kolob Terrace.

We were well-prepared: we had brought our lunch, and, as it turns out, we had brought the right clothing, for by the time we reached the Northgate Peaks trail it was already in the low 50's, and by the time we reached trail's end it was in the low 60's. Sunny skies, perfect temperatures, no bugs, and a nearly-level 2 mile hike to an amazing canyon viewpoint: is there any better way to spend a day in the mountains?

On our way back down, we stopped at Hoodoo City and tried to follow the trail over to see the peculiar rock formations, but it was slow, sandy going, and the closer we got to the rocks, the more they seemed to fade into the distance. Our decision was made for us when we met a couple returning from the trail who told us they were pretty sure they'd heard a mountain lion growling just a few dozen yards from the trail.

So back down the hill we went, and decided to settle for a yummy dinner at the local brewpub.

All good things must come to an end, and it was time to return to civilization, so we got a good early start on our final day in the mountains and made a short stop at the third part of Zion National Park which is easily accessible: Kolob Canyons. Happily, we had just enough time to drive up to the end of the road to take in the truly remarkable views. The views from the roadside parking lot are superb; the views from the end of the Timber Creek Overlook trail are even better.

Back down I-15 to Las Vegas we went. My mother, who knows a lot about this part of the world, swears that U.S. 395 along the Eastern Sierra is the most beautiful road in the 48 states, and she's got a fine case, but I think that the stretch of I-15 from Las Vegas, Nevada to Cedar City, Utah is a serious contender, particularly on a clear winter's day when the view goes on forever (well, at least 50 miles).

It was as nice a way as one could ask to end as nice a weekend as one could hope for.

If you ever get a chance to visit Zion National Park in winter, take it.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-12

Tue, 2017-12-12 18:58
  • The Case for Learned Index Structures

    ‘Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.’ Excellent follow-up thread from Henry Robinson: ‘The fact that the learned representation is more compact is very neat. But also it’s not really a surprise that, given the entire dataset, we can construct a more compact function than a B-tree which is *designed* to support efficient updates.’ […] ‘given that the model performs best when trained on the whole data set – I strongly doubt B-trees are the best we can do with the current state-of-the art.’

    (tags: data-structures ml google b-trees storage indexes deep-learning henry-robinson)

  • Internet protocols are changing

    per @mnot. HTTP/2; TLS 1.3; QUIC and UDP; and DOH (DNS over HTTP!)

    (tags: crypto encryption http https protocols http2 tls quic udp tcp dns tunnelling)

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-11

Mon, 2017-12-11 18:58
Categories: FLOSS Project Planets

Colm O hEigeartaigh: SAML SSO support for the Apache Syncope web console

Fri, 2017-12-08 12:09
Apache Syncope is a powerful open source Identity Management project, that has recently celebrated 5 years as an Apache top level project. Up to recently, a username and password must be supplied to log onto either the admin or enduser web consoles of Apache Syncope. However SAML SSO login is now supported since the 2.0.3 release. Instead of supplying a username/password, the user is redirected to a third party IdP for login, before redirecting back to the Apache Syncope web console. In 2.0.5, support for the IdP-initiated flow of SAML SSO was added.

In this post we will show how to configure Apache Syncope to use SAML SSO as an alternative to logging in using a username and password. We will use Apache CXF Fediz as the SAML SSO IdP. In addition, we will show how to achieve IdP-initiated SSO using Okta. Please also refer to this tutorial on achieving SAML SSO with Syncope and Shibboleth.

1) Logging in to Apache Syncope using SAML SSO

In this section, we will cover setting up Apache Syncope to re-direct to a third party IdP so that the user can enter their credentials. The next section will cover the IdP-initiated case.

1.a) Enable SAML SSO support in Apache Syncope

First we will configure Apache Syncope to enable SAML SSO support. Download and extract the most recent standalone distribution release of Apache Syncope (2.0.6 was used in this post). Start the embedded Apache Tomcat instance and then open a web browser and navigate to "http://localhost:9080/syncope-console", logging in as "admin" and "password".

Apache Syncope is configured with some sample data to show how it can be used. Click on "Users" and add a new user called "alice" by clicking on the subsequent "+" button. Specify a password for "alice" and then select the default values wherever possible (you will need to specify some required attributes, such as "surname"). Now in the left-hand column, click on "Extensions" and then "SAML 2.0 SP". Click on the "Service Provider" tab and then "Metadata". Save the resulting Metadata document, as it will be required to set up the SAML SSO IdP.

1.b) Set up the Apache CXF Fediz SAML SSO IdP

Next we will turn our attention to setting up the Apache CXF Fediz SAML SSO IdP. Download the most recent source release of Apache CXF Fediz (1.4.3 was used for this tutorial). Unzip the release and build it using maven ("mvn clean install -DskipTests"). In the meantime, download and extract the latest Apache Tomcat 8.5.x distribution (tested with 8.5.24). Once Fediz has finished building, copy all of the "IdP" wars (e.g. in fediz-1.4.3/apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/idp/war/fediz-*) to the Tomcat "webapps" directory.

There are a few configuration changes to be made to Apache Tomcat before starting it. Download the HSQLDB jar and copy it to the Tomcat "lib" directory. Next edit 'conf/server.xml' and configure TLS on port 8443:

The two keys referenced here can be obtained from 'apache-fediz/target/apache-fediz-1.4.3/apache-fediz-1.4.3/examples/samplekeys/' and should be copied to the root directory of Apache Tomcat. Tomcat can now be started.

Next we have to configure Apache CXF Fediz to support Apache Syncope as a "service" via SAML SSO. Edit 'webapps/fediz-idp/WEB-INF/classes/entities-realma.xml' and add the following configuration:

In addition, we need to make some changes to the "idp-realmA" bean in this file:
  • Add a reference to this bean in the "applications" list: <ref bean="srv-syncope" />
  • Change the "idpUrl" property to: https://localhost:8443/fediz-idp/saml
  • Change the port for "stsUrl" from "9443" to "8443".
Now we need to configure Fediz to accept Syncope's signing cert. Edit the Metadata file you saved from Syncope in step 1.a. Copy the Base-64 encoded certificate in the "KeyDescriptor" section, and paste it (including line breaks) into 'webapps/fediz-idp/WEB-INF/classes/syncope.cert', enclosing it in between "-----BEGIN CERTIFICATE-----" and "-----END CERTIFICATE-----".

Now restart Apache Tomcat. Open a browser and save the Fediz metadata which is available at "http://localhost:8080/fediz-idp/metadata?protocol=saml", which we will require when configuring Apache Syncope.

1.c) Configure the Apache CXF Fediz IdP in Syncope

The final configuration step takes place in Apache Syncope again. In the "SAML 2.0 SP" configuration screen, click on the "Identity Providers" tab and click the "+" button and select the Fediz metadata that you saved in the previous step. Now logout and an additional login option can be seen:

Select the URL for the SAML SSO IdP and you will be redirected to Fediz. Select the IdP in realm "A" as the home realm and enter credentials of "alice/ecila" when prompted. You will be successfully authenticated to Fediz and redirected back to the Syncope admin console, where you will be logged in as the user "alice". 

2) Using IdP-initiated SAML SSO

Instead of the user starting with the Syncope web console, being redirected to the IdP for authentication, and then redirected back to Syncope - it is possible instead to start from the IdP. In this section we will show how to configure Apache Syncope to support IdP-initiated SAML SSO using Okta.

2.a) Configuring a SAML application in Okta

The first step is to create an account at Okta and configure a SAML application. This process is mapped out at the following link. Follow the steps listed on this page with the following additional changes:
  • Specify the following for the Single Sign On URL: http://localhost:9080/syncope-console/saml2sp/assertion-consumer
  • Specify the following for the audience URL: http://localhost:9080/syncope-console/
  • Specify the following for the default RelayState: idpInitiated
When the application is configured, you will see an option to "View Setup Instructions". Open this link in a new tab and find the section about the IdP Metadata. Save this to a local file and set it aside for the moment. Next you need to assign the application to the username that you have created at Okta.

2.b) Configure Apache Syncope to support IdP-Initiated SAML SSO

Log on to the Apache Syncope admin console using the admin credentials, and add a new IdP Provider in the SAML 2.0 SP extension as before, using the Okta metadata file that you have saved in the previous section. Edit the metadata and select the 'Support Unsolicited Logins' checkbox. Save the metadata and make sure that the Okta user is also a valid user in Apache Syncope.

Now go back to the Okta console and click on the application you have configured for Apache Syncope. You should seemlessly be logged into the Apache Syncope admin console.

Categories: FLOSS Project Planets

Bryan Pendleton: In the Valley of Gods

Thu, 2017-12-07 23:51

Oh boy!

Oh boy oh boy oh boy oh boy oh boy!!!

Campo Santo return!

Campo Santo, makers of the astonishingly great Firewatch (you know, the game with that ending), have started to reveal some of the information about their next game: In the Valley of Gods.

In the Valley of Gods is a single-player first person video game set in Egypt in the 1920s. You play as an explorer and filmmaker who, along with your old partner, has traveled to the middle of the desert in the hopes of making a seemingly-impossible discovery and an incredible film.

Here's the In the Valley of Gods "reveal trailer".

Looking forward to 2019 already!

Categories: FLOSS Project Planets

Ortwin Glück: [Code] Gentoo enables PIE

Thu, 2017-12-07 05:59
Gentoo has new profiles that require you to "recompile everything". That is technically not really necessary. Only static libraries really need recompiling.

Here is why:
A static library is just an archive of .o files (similar to tar), nothing more, and linking against a static library is roughly the same as just adding more .o files to the linker line. You can also link a static library into a shared library - the code in the static library is then just copied into the shared library (but the code then must be compiled with -fPIC, as with all other code that is used in shared libraries).

You can find static libs like so: equery b $(find /usr/lib/ /lib/ -name *.a) | awk '{ print $1; }' | sort | uniq Typically this yields packages like elfutils, libbsd, nss, iproute2, keyutils, texinfo, flex, db, numactl.

Categories: FLOSS Project Planets

Bryan Pendleton: Another milestone in computer chess

Wed, 2017-12-06 23:43

This just in from the Deep Mind team: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

The AlphaZero algorithm is a more generic version of the AlphaGo Zero algorithm that was first introduced in the context of Go (29). It replaces the handcrafted knowledge and domain-specific augmentations used in traditional game-playing programs with deep neural networks and a tabula rasa reinforcement learning algorithm.


AlphaZero convincingly defeated all opponents, losing zero games to Stockfish and eight games to Elmo (see Supplementary Material for several example games), as well as defeating the previous version of AlphaGo Zero.


we analysed the chess knowledge discovered by AlphaZero. Table 2 analyses the most common human openings (those played more than 100,000 times in an online database of human chess games (1)). Each of these openings is independently discovered and played frequently by AlphaZero during self-play training. When starting from each human opening, AlphaZero convincingly defeated Stockfish, suggesting that it has indeed mastered a wide spectrum of chess play.

As for myself, I seem to hang pieces more frequently than I did a decade ago.

But I still love chess.

And, in that part of the world not (yet) inhabited solely by deep neural networks, That Norwegian Genius is going to play again, in London, next November: London Will Host FIDE World Chess Championship Match 2018.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-06

Wed, 2017-12-06 18:58
  • In first, 3-D printed objects connect to WiFi without electronics

    This. is. magic.

    Physical motion—pushing a button, laundry soap flowing out of a bottle, turning a knob, removing a hammer from a weighted tool bench—triggers gears and springs elsewhere in the 3-D printed object that cause a conductive switch to intermittently connect or disconnect with the antenna and change its reflective state. Information—in the form of 1s and 0s—is encoded by the presence or absence of the tooth on a gear. Energy from a coiled spring drives the gear system, and the width and pattern of gear teeth control how long the backscatter switch makes contact with the antenna, creating patterns of reflected signals that can be decoded by a WiFi receiver.

    (tags: magic wifi whoa 3d-printing objects plastic gears springs)

Categories: FLOSS Project Planets

Sam Ruby: Achieving Response Time Goals with Service Workers

Wed, 2017-12-06 16:23

Service Workers enable a web application to be responsive even if the network isn't. Frameworks like AngularJS, React and Vue.js enable web applications to efficiently update and render web pages as data changes.

The Apache Software Foundation's Whimsy board agenda application uses both in combination to achieve a responsive user experience - both in terms of quick responses to user requests and quick updates based on changes made on the server.

From a performance perspective, the two cases easiest to optimize for are (1) the server fully up and running accessed across a fast network with all possible items cached, and (2) the application fully offline as once you make offline possible at all, it will be fast.

The harder cases ones where the server has received a significant update and needs to get that information to users, and even harder is when the server has no instances running and needs to spin up a new instance to process a request. While it is possible to do blue/green deployment for applications that are "always on", this isn't practical or appropriate for applications which only used in periodic bursts. The board agenda tool is one such application.

This article describes how a goal of sub-second response time is achieved in such an environment. There are plenty of articles on the web that show snippets or sanitized approaches, this one focuses on real world usage.

Introduction to Service Workers

Service Workers are JavaScript files that can intercept and provide responses to navigation and resource requests. Service Workers are supported today by Chrome and FireFox, and are under development in Microsoft Edge and WebKit/Safari.

Service Workers are part of a larger effort dubbed "Progressive Web Apps" that aim to make web applications reliable and fast, no matter what the state of the network may happen to be. The word "progressive" in this name is there to indicate that these applications will work with any browser to the best of that browser's ability.

The signature or premier feature of Service Workers is offline applications. Such web applications are loaded normally the first time, and cached. When offline, requests are served by the cache, and any input made by users can be stored in local storage or in an index db. and The Offline Cookbook provide a number of recipes that can be used.

Overview of the Board Agenda Tool

This information is for background purposes only. Feel free to skim or skip.

The ASF Board meets monthly, and minutes are published publicly on the web. A typical meeting has over one hundred agenda items, though the board agenda tool assists in resolving most off them offline, leaving a manageable 9 officer reports, around 20 PMC reports that may or may not require action, and a handful of special orders.

While the full agenda is several thousand lines long, this file size is only a quarter of a megabyte or the size of a small image. The server side of this application parses the agenda and presents it to the client in JSON format, and the result is roughly the same size as the original.

To optimize the response of the first page access, the server is structured to do server side rendering of the page that is requested, and the resulting response starts with links to stylesheets, then contains the rendered HTML, and finally any scripts and data needed. This allows the browser to incrementally render the page as it is received. This set of scripts includes a script that can render any page (or component) that the board agenda tool can produce, and the data includes all the information necessary to do so. The current implementation is based on Vue.js.

Once loaded, traversals between pages is immeasurably quick. By that I mean that you can go to the first page and lean on the right arrow button and pages will smoothly scroll through the pages by at roughly the rate at which you can see the faces in a deck of cards shuffled upside down.

The pages generally contain buttons and hidden forms; which buttons appear often depends on the user who requests the page. For example, only Directors will see approve and unapprove buttons; and individual directors will only see one of these two buttons based on whether or not they have already approved the report.

A WebSocket between the server and client is made mostly so the server can push changes to each client; changes that then cause re-rendering and updated displays. Requests from the client to the server generally are done via XMLHttpRequest as it wasn't until very recently that Safari supported fetch. IE still doesn't, but Edge does.

Total (uncompressed) size of the application script is another quarter of a megabyte, and dependencies include Vue.js and Bootstrap, the latter being the biggest requiring over a half a megabyte of minimized CSS.

All scripts and stylesheets are served with a Cache-Control: immutable header as well as an expiration date a year from when the request was made. This is made possible by the expedient of utilizing a cache busting query string that contains the last modified date. Etag and 304 responses are also supported.

Offline support was added recently. Updates made when offline are stored in an IndexDB and sent as a batch when the user returns online. Having all of the code and data to render any page made this support very straightforward.

Performance observations (pre-optimization)

As mentioned at the top of this article, offline operations are virtually instantaneous. Generally, immeasurably so. As described above, this also applies to transitions between pages.

This leaves the initial visit, and returning visits, the latter includes opening the application in new tabs.

Best case response times for these cases is about a second. This may be due to the way that server side rendering is done or perhaps due to the fact that each page is customized to the individual. Improving on this is not a current priority, though the solution described later in this article addresses this.

Worst case response times are when there are no active server processes and all caches (both server side and client side) are either empty or stale. It is hard to get precise numbers for this, but it is on the order of eight to ten seconds. Somewhere around four is the starting of the server. Building the JSON form of the agenda can take another two given all of the validation (involving things like LDAP queries) involved in the process. Regenerating the ES5 JavaScript from sources can take another second or so. Producing the custom rendered HTML is another second. And then there is all of the client side processing.

In all, probably just under ten seconds if the server is otherwise idle. It can be a little more if the server is under moderate to heavy load.

The worst parts of this:

  1. No change is seen on the browser window until the last second or so.
  2. While the worst case scenario is comparatively rare in production, it virtually precisely matches what happens in development.
Selecting an approach

Given that the application can be brought up quickly in an entirely offline mode, one possibility would be to show the last cached status and then request updated information and process that information when received. This approach works well if the only change is to agenda data, but doesn't work so well in production whenever a script change is involved.

This can be solved with a window.location.reload() call, which is described (and somewhat discouraged) as approach #2 in Dan Fabulic's "How to Fix the Refresh Button When Using Service Workers". Note the code below was written before Dan's page was published, but in any case, Dan accurately describes the issue.

Taking some measurements on this produces interesting results. What is needed to determine if a script or stylesheet has changed is a current inventory from the server. This can consistently be provided quickly and is independent of the user requesting the data, so it can be cached. But since the data size is small enough, caching (in the sense of HTTP 304 reponses) isn't all that helpful.

Response time for this request in realistic network conditions when there is an available server process is around 200 milliseconds, and doesn't tend to vary very much.

The good news is that this completely addresses the "reload flash" problem.

Unfortunately, the key words here are "available server process" as that was the original problem to solve.

Fortunately, a combination approach is possible:

  1. Attempt to fetch the inventory page from the network, but give it a deadline that it should generally beat. Say, 500 milliseconds or a half a second.
  2. If the deadline isn't met, load potentially stale data from the cache, and request newer data. Once the network response is received (which had a 500 millisecond head start), determine if any scripts or stylesheets changed. If not, we are done.
  3. Only if the deadline wasn't met AND there was a change to a stylesheet or more commonly a script, perform a reload; and figure out a way to address the poor user experience associated with a reload.

Additional exploration lead to the solution where the inventory page mentioned below could be formatted in HTML and, in fact, be the equivalent to a blank agenda page. Such a page would still be less than 2K bytes, and performance would be equivalent to loading a blank page and then navigating to the desired page, in other words, immeasurably fast.


If you look at existing recipes, Network or Cache is pretty close; the problem is that it leaves the user with stale data if the network is slow. It can be improved upon.

Starting with the fetch from the network:

// attempt to fetch bootstrap.html from the network fetch(request).then(function(response) { // cache the response if OK, fulfill the response if not timed out if (response.ok) { cache.put(request, response.clone()); // preload stylesheets and javascripts if (/bootstrap\.html$/.test(request.url)) { response.clone().text().then(function(text) { var toolate = !timeoutId; setTimeout( function() { preload(cache, request.url, text, toolate) }, (toolate ? 0 : 3000) ) }) }; if (timeoutId) { clearTimeout(timeoutId); resolve(response) } } else { // bad response: use cache instead replyFromCache(true) } }).catch(function(failure) { // no response: use cache instead replyFromCache(true) })

This code needs to be wrapped in a Promise that provides a resolve function, and needs access to a cache as well as a variable named timeoutid and that determines whether or not the response has timed out.

If the response is ok, it and will be cached and a preload method will be called to load resources mentioned in the page. That will either be done immediately if not toolate, or after a short delay the timer expired to allow updates to be processed. Finally, if such a response was received in time, the timer will be cleared, and the promise will be resolved.

If either a bad response or no response was received (typically, this represents a network failure), the cache will be used instead.

Next the logic to reply from the cache:

// common logic to reply from cache var replyFromCache = function(refetch) { return cache.match(request).then(function(response) { clearTimeout(timeoutId); if (response) { resolve(response); timeoutId = null } else if (refetch) { fetch(event.request).then(resolve, reject) } }) }; // respond from cache if the server isn't fast enough timeoutId = setTimeout(function() {replyFromCache(false)}, timeout);

This code looks for a cache match, and if it finds one, it will resolve the response, and clear the timeoutId enabling the fetch code to detect if it was too late.

If no response is found, the action taken will be determined by the refetch argument. The fetch logic above passes true for this, and the timeout logic passes false. If true, it will retry the original request (which presumably will fail) and return that result to the user. This is handling a never should happen scenario where the cache doesn't contain the bootstrap page.

The above two snippets of code are then wrapped by a function, providing the event, resolve, reject, and cache variables, as well as declaring and initializing the timeoutId variable:

// Return a bootstrap.html page within 0.5 seconds. If the network responds // in time, go with that response, otherwise respond with a cached version. function bootstrap(event, request) { return new Promise(function(resolve, reject) { var timeoutId = null;"board/agenda").then(function(cache) { ... } })

Next, we need to implement the preload function:

// look for css and js files and in HTML response ensure that each are cached function preload(cache, base, text, toolate) { var pattern = /"[-.\w+/]+\.(css|js)\?\d+"/g; var count = 0; var changed = false; while (match = pattern.exec(text)) { count++; var path = match[0].split("\"")[1]; var request = new Request(new URL(path, base)); cache.match(request).then(function(response) { if (response) { count-- } else { fetch(request).then(function(response) { if (response.ok) cacheReplace(cache, request, response); count--; if (count == 0 && toolate) { clients.matchAll().then(function(clients) { clients.forEach(function(client) { client.postMessage({type: "reload"}) }) }) } }) } }) } };

This code parses the HTML response, looking for .css, and .js files, based on a knowledge as to how this particular server will format the HTML. For each such entry in the HTML, the cache is searched for a match. If one is found, nothing more needs to be done. Otherwise, the resource is fetched and placed in the cache.

Once all requests are processed, and if this involved requesting a response from the network, then a check is made to see if this was a late response, and if so, a reload request is sent to all client windows.

cacheReplace is another application specific function:

// insert or replace a response into the cache. Delete other responses // with the same path (ignoring the query string). function cacheReplace(cache, request, response) { var path = request.url.split("?")[0]; cache.keys().then(function(keys) { keys.forEach(function(key) { if (key.url.split("?")[0] == path && key.url != path) { cache.delete(key).then(function() {}) } }) }); cache.put(request, response) };

The purpose of this method is as stated: to delete from the cache other responses that differ only in the query string. It also adds the response to the cache.

The remainder is either straightforward or application specific in a way that has no performance relevance. The scripts and stylesheets are served with a cache falling back to network strategy. The initial preloading which normally could be as simple as a call to cache.addAll needs to be aware of query strings and for this application it turns out that a different bootstrap HTML file is needed for each meeting.

Finally, here is the client side logic which handles reload messages from the service worker:

navigator.serviceWorker.register(scope + "sw.js", scope).then(function() { // watch for reload requests from the service worker navigator.serviceWorker.addEventListener("message", function(event) { if ( == "reload") { // ignore reload request if any input or textarea element is visible var inputs = document.querySelectorAll("input, textarea"); if (Math.max.apply( Math, { return element.offsetWidth }) ) <= 0) window.location.reload() } }); }

This code watches for type: "reload" messages from the service worker and invokes window.location.reload() only if there are no input or text area elements visible, which is determined using the offsetWidth property of each element. Very few board agenda pages have visible input fields by default; many, however, have bootstrap modal dialog boxes containing forms.

Performance Results

In production when using a browser that supports Service Workers, requests for the bootstrap page now typically range from 100 to 300 milliseconds, with the resulting page fully loaded in 400 to 600 milliseconds. Generally, this includes the time it takes to fetch and render updated data, but in rare cases that may take up to an additional 200 milliseconds.

In development, and in production when there are no server processes available and when accessed using a browser that supports Service Workers, the page initially loads in 700 to 1200 milliseconds. It is not clear to me why this sees a greater range of response times; but in any case, this is still a notable improvement. Often in development, and in rare cases in production, there may be a noticeable refresh that occurs one to five seconds later.

Visitations by browsers that do not support service workers, and for that matter the first time a new user visits the board agenda tool, do not see any performance improvement or degradation with these changes.

Not a bad result from less than 100 lines of code.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-05

Tue, 2017-12-05 18:58

    Quite unusual to see an honest review of travelling coach-class on an internal US flight. This is a massive stinker: “I admit American isn’t my favourite airline, but this has made me seriously re-evaluate ever travelling on them again. And it won’t be economy. If this is Americans idea of their future standards, they can keep it. Aviation enthusiasts might find it really interesting- I felt like I was in a flying prison”.

    (tags: coach travel aa airlines 737 boeing reviews comfort)

  • Using AWS Batch to Generate Mapzen Terrain Tiles · Mapzen

    Using this setup on AWS Batch, we are able to generate more than 3.75 million tiles per minute and render the entire world in less than a week! These pre-rendered tiles get stored in S3 and are ready to use by anyone through the AWS Public Dataset or through Mapzen’s Terrain Tiles API.

    (tags: mapzen mapping tiles batch aws s3 lambda docker)

  • Theresa May’s Blue Monday — Fintan O’Toole

    Having backed down, May was then peremptorily informed that she was not even allowed to back down. She left her lunch with the president of the European Commission, Jean-Claude Juncker, to take a phone call from the DUP’s Arlene Foster, who told her that the deal she had just made was unacceptable. May then had to go back in and tell Juncker that she could not agree to what she had just agreed to. It is a scarcely credible position for a once great state to find itself in: its leader does not even have the power to conduct a dignified retreat.

    (tags: eu ireland brexit uk theresa-may dup politics ec fintan-otoole)

  • Handling GDPR: How to make Kafka Forget

    How do you delete (or redact) data from Kafka? The simplest way to remove messages from Kafka is to simply let them expire. By default Kafka will keep data for two weeks and you can tune this as required. There is also an Admin API that lets you delete messages explicitly if they are older than some specified time or offset. But what if we are keeping data in the log for a longer period of time, say for Event Sourcing use cases or as a source of truth? For this you can make use of  Compacted Topics, which allow messages to be explicitly deleted or replaced by key. Similar applies to Kinesis I would think.

    (tags: kafka kinesis gdpr expiry deleting data privacy)

Categories: FLOSS Project Planets

Ortwin Glück: [Code] Gentoo python-3.5 update woes

Tue, 2017-12-05 05:37
Your latest emerge -uavD @world may result in the following error: # required by sys-apps/portage-2.3.13-r1::gentoo[python_targets_python3_4,-build,python_targets_python2_7,-python_targets_python3_5] # required by virtual/package-manager-0::gentoo # required by @system # required by @world (argument) >=dev-python/pyblake2-0.9.3-r1 python_targets_python3_4 That's because python-3.5 as well as python-3.6 and a new version of portage went stable, that in turn causes python USE flag changes and portage can't figure out correctly what to do.

Solve this by manually installing python-3.5 (only) first and recompiling the resulting USE flag changes:
emerge -1av python:3.5 eselect python update emerge -1avD --changed-use @world That should let you update world again and depclean will remove python-3.4 for you. emerge -uavD @world emerge --depclean

Categories: FLOSS Project Planets

Aaron Morton: TLP Dashboards for Datadog users, out of the box.

Mon, 2017-12-04 19:00

We had the pleasure to release our monitoring dashboards designed for Apache Cassandra on Datadog last week. It is a nice occasion to share our thoughts around Cassandra Dashboards design as it is a recurrent question in the community.

We wrote a post about this on the Datadog website here.

For people using Datadog we hope this will give more details on how the dashboards were designed, thus on how to use the dashboards we provided. For others, we hope this information will be useful in the process of building and then using your own dashboards, with the technology of your choice.

The Project

Building an efficient, complete, and readable set of dashboards to monitor Apache Cassandra is time consuming and far from being straightforward.

Those who tried it probably noticed it requires a fair amount of time and knowledge with both the monitoring technology in use (Datadog, Grafana, Graphite or InfluxDB, metrics-reporter, etc) and of Apache Cassandra. Creating dashboards is about picking the most relevant metrics, aggregations, units, chart type and then gather them in a way that this huge amount of data actually provides usable information. Dashboards need to be readable, understandable and easy to use for the final operator.

On one hand, creating comprehensive dashboards is a long and complex task. On the other hand, every Apache Cassandra cluster can be monitored roughly the same way. Most production issues can be detected and analyzed using a common set of charts, organized the same way, for all the Apache Cassandra clusters. Each cluster may require additional operator specific dashboards or charts depending on workload and merging of metrics outside of Cassandra, but those would supplement the standard dashboards, not replace them. There are some differences depending on the Apache Cassandra versions in use, but they are relatively minor and not subject to rapid change.

In my monitoring presentation at the 2016 Cassandra Summit I announced that we were working on this project.

In December 2017 it was release for Datadog users. If you want to get started with these dashboards and you are using Datadog, see how to do this documentation on Datadog integration for Cassandra.

Dashboard Design Our Approach to Monitoring

The dashboards have been designed to allow the operator to do the following:

  1. Easily detect any anomaly (Overview Dashboard)
  2. Be able to efficiently troubleshoot and fix the anomaly (Themed Dashboards)
  3. Find the bottlenecks and optimize the overall performance (Themed Dashboards)

The 2 later points above can be seen as the same kind of operations which can be supported by the same set of dashboards.

Empowering the operator

We strongly believe that showing the metrics to the operator can be a nice entry point for learning about Cassandra. Each of the themed dashboards monitor a distinct internal processes of Cassandra. Most of the metrics related to this internal process are then grouped up within a Dashboard. We think it makes it easier for the operator to understand Cassandra’s internal processes.

To make it clearer, let’s consider the example of someone completely new to Cassandra. On first repair, the operator starts an incremental repair without knowing anything about it and latencies increase substantially after a short while. Classic.

The operator would notice a read latency in the ‘Overview Dashboard’, then aim at the ‘Read Path Dashboard’. There the operator would be able to notice that the number of SSTables went from 50 to 800 on each node, or for a table. If the chart is there out of the box, even if not knowing what an SSTable is the operator can understand something changed there and that it relates to the outage somehow. The operator would then search in the right direction, probably solving the issue quickly, and possibly learning in the process.

What to Monitor: Dashboards and Charts Detail

Here we will be really focusing on charts details and indications on how to use each chart efficiently. While this post is a discussion of dashboards available for DataDog, the metrics can be visualized using any tool, and we believe this would be a good starting point when setting up monitoring for Cassandra.

In the graphs, the values and percentiles chosen are sometime quite arbitrary and often depend on the use case or Cassandra setup. The point is to give a reference, a starting point on what could be ‘normal’ or ‘wrong’ values. The Apache Cassandra monitoring documentation, the mailing list archive, or #cassandra on #freenode (IRC) are good ways to answer questions that might pop while using dashboards.

Some dashboards are voluntary duplicated across dashboards or within a dashboard, but with distinct visualisation or aggregation.

Detect anomalies: Overview Dashboard

We don’t try to troubleshoot at this stage. We want to detect outages that might impact the service or check that the Cassandra cluster is globally healthy. To accomplish this, this Overview Dashboard aims at both being complete and minimalist.

Complete as we want to be warned anytime “something is happening“ in the Cassandra cluster. Minimalist because we don’t want to miss an important information here because of the flood of non-critical or too low level informations. These charts aim answer the question: “Is Cassandra healthy?”.

Troubleshoot issues and optimize Apache Cassandra: Themed dashboards

The goal here is to divide the information into smaller, more meaningful chunks. When having an issue, it will often only affect one of the subsystems of Cassandra, so the operator can have all the needed information in one place when working on a specific issue, without having irrelevant informations (for this specific issue) hiding more important information.

For this reason these dashboards must maximize the information on a specific theme or internal process of Cassandra and show all the low level information (per table, per host). We are often repeating charts from other dashboards, so we always find the information we need as Cassandra users. This is the contrary to the overview dashboard needs mentioned above that just shows “high level” information.

Read Path Dashboard

In this dashboard we are concerned about any element that could impact a high level client read. In fact, we want to know about everything that could affect the read path in Cassandra by just looking at this dashboard.

Write Path Dashboard

This dashboard focuses on a comprehensive view of the various metrics which affect write latency and throughput. Long garbage collection pause times will always result in dips in throughput and spikes in latency, so it is featured prominently on this dashboard.

SSTable management Dashboard

This dashboard is about getting a comprehensive view of the various metrics which impact the asynchronous steps the data goes through after a write, from the flush to the data deletion with all the compaction processes in between. Here we will be willing to be aware of disk space evolution and make sure asynchronous management of SSTables is happening efficiently or as expected.

Alerting, Automated Anomaly Detection.

To conclude, when happy with monitoring dashboards, it is a good idea to add some alerting rules.

It is important to detect all the anomalies as quickly as possible. To bring monitoring to the next level of efficiency, it is good to be warned automatically when something goes wrong.

We believe adding alerts on each of the “Overview Dashboard” metrics will be sufficient to detect most issues and any major outage, or at least be a good starting point. For each metric, the alerting threshold should be high enough not to trigger false alerts to ensure a mitigating action can be taken. Some alerts should use absolute value (Disk space available, CPU, etc), while others will require relative values. Manually tuning some alerts will be required based on configuration and workload, such as alerting on the latencies.

The biggest risk on alerting is probably to be flooded by false alerts as the natural inclination to start ignoring them, which leads to missing valid ones. As a global guideline, any alert should trigger an action, if it does not, this alert is relatively useless and adds noise.

Categories: FLOSS Project Planets

Justin Mason: Links for 2017-12-04

Mon, 2017-12-04 18:58
  • Bella Caledonia: A Wake-Up Call

    Swathes of the British elite appeared ignorant of much of Irish history and the country’s present reality. They seemed to have missed that Ireland’s economic dependence on exports to its neighbour came speedily to an end after both joined the European Economic Community in 1973. They seemed unacquainted with Ireland’s modern reality as a confident, wealthy, and internationally-oriented nation with overwhelming popular support for EU membership. Repeated descriptions of the border as a “surprise” obstacle to talks betrayed that Britain had apparently not listened, or had dismissed, the Irish government’s insistence in tandem with the rest of the EU since April that no Brexit deal could be agreed that would harden the border between Ireland and Northern Ireland. The British government failed to listen to Ireland throughout history, and it was failing to listen still.

    (tags: europe ireland brexit uk ukip eu northern-ireland border history)

  • AWS re:invent 2017: Advanced Design Patterns for Amazon DynamoDB (DAT403) – YouTube

    Video of one of the more interesting sessions from this year’s Re:invent

    (tags: reinvent aws dynamodb videos tutorials coding)

  • AWS re:invent 2017: Container Networking Deep Dive with Amazon ECS (CON401) // Practical Applications

    Another re:Invent highlight to watch — ECS’ new native container networking model explained

    (tags: reinvent aws containers docker ecs networking sdn ops)

  • VLC in European Parliament’s bug bounty program

    This was not something I expected:

    The European Parliament has approved budget to improve the EU’s IT infrastructure by extending the free software security audit programme (FOSSA) and by including a bug bounty approach in the programme. The Commission intends to conduct a small-scale “bug bounty” activity on open-source software with companies already operating in the market. The scope of this action is to: Run a small-scale “bug bounty” activity for open source software project or library for a period of up to two months maximum; The purpose of the procedure is to provide the European institutions with open source software projects or libraries that have been properly screened for potential vulnerabilities; The process must be fully open to all potential bug hunters, while staying in-line with the existing Terms of Service of the bug bounty platform.

    (tags: vlc bug-bounties security europe europarl eu ep bugs oss video open-source)

Categories: FLOSS Project Planets