Network Maintenance – Impacting VSO Hosted Build Service – 8/23 0:00 UTC – 8/24 12:00 UTC.

This is a proactive notification to let you know that our partners in Azure are performing network maintenance operations in multiple Microsoft Azure regions and data centers. VSO Hosted Build customers are expected to be impacted with this maintenance.

Maintenance Window – UTC/GMT 8/23 00:00  to 8/24 12:00.

For more details on the Azure network maintenance please refer to URL

We appreciate your patience during the maintenance. 


VS Online Service Delivery Team

Prism for WPF License Announcement

Prism for WPF and it associated projects now use the Apache 2.0 license.

The latest source code can be found on the following CodePlex sites:

  • Prism.Composition,
  • Prism.Interactivity
  • Prism.UnityExtensions
  • Prism.MefExtensions

  • Prism.Mvvm

  • Prism.PubSubEvents

Prism.Mvvm and Prism.PubSubEvents are portable class libraries that now target Windows Phone 8.1.

Prism.Mvvm targets:

  • .NET framework 4.5
  • Windows 8
  • Windows Phone 8.1
  • Windows Phone Silverlight 8

Prism.PubSubEvents targets:

  • .NET Framework 4
  • Silverlight 5
  • Windows 8
  • Windows Phone 8.1
  • Windows Phone Silverlight 8

More to try… ASP.Net vNext, .NET Native, RyuJIT and develop ASP.Net vNext apps on Mac!!

Earlier I covered about .Net Framework vNext . In addition to that, you can try out the ASP.Net vNext, .NET Native and RyuJIT releases by installing Visual Studio 14 CTP3 . ASP.Net vNext : ASP.Net vNext is the new version of ASP.NET for web sites and services. We’ve continued to add new features and improve the development experience for ASP.NET vNext apps in Visual Studio “14”. It’s useful to recap what ASP.NET vNext offers, and why you should choose it for your next web platform. For more info…(read more)

Synthetic Transaction Test Monitor


A while back when I was working with the legendary Jon Almquist, I picked up this little gem from him. I don’t know where he got it from. He probably just whipped it up in a matter of minutes, knowing him. I thought I might post it here as it has proved to be very useful and others may find the same.


You simply copy/paste the following line into a command prompt:

EventCreate /T ERROR /ID 101 /L APPLICATION /SO TEST /D “This is a synthetic transaction test only. Disregard this event. Note: This alert was generated by entering the below command into a command pompt on the server in question:> EventCreate /T ERROR /ID 101 /L APPLICATION /SO TEST /D




Back from blogcation

Well, it’s been a while, but I’m back from my extended blogcation.  Since my job has shifted again, I’m going to refocus this blog back to development side of things going forward.  Today I have made a cool discovery about OperationContextScopes that might save another developer some time, so I’m going to get started with something I hope will be useful.

Imagine you are building a client application which uses WCF, and you’d like to add some headers before you invoke the web method on a proxy.  The code might look something like:

OperationContextScope scope = new OperationContextScope(myProxy.InnerChannel);

MessageHeader<string> myHeader = newMessageHeader<string>(“My header value);

MessageHeader headerWrapper = myHeader.GetUntypedHeader(“MyHeaderName”, “”);



Of course you’d want to package this up in some way so that this particular code can be reused for all your web method calls, but that is a different blog post.  Anyway, if you write your code this way, you might be surprised to learn that you are leaking OperationContextScope objects.  You can tell this is happening because looking in Task Manager you can see the memory consumption of your application grow, and grow as more web method calls are executed.  You can find out what is leaking by using this excellent tool called CLRProfiler (  Run your application for long enough, and if the application repeatedly calls web methods to get data to refresh its data model over time you’ll see when you dump the heap using CLRProfiler that although the variable scope in the above example goes out of scope, it’s storage is never reclaimed by the GC as you might expect it would.  As it turns out, WCF maintains a stack of OperationContextScope instances.  Making a new OperationContextScope instance pushes onto that stack, and disposing of the topmost OperationContextScope instance pops the stack.  If you never dispose of the OperationContextScope instance, the stack just keeps growing.  Because the stack is accessible, the GC isn’t allowed to collect anything on the stack.  Thus the leak.  To fix the above code is super simple once you’ve figured out the problem:


using (OperationContextScope scope = new OperationContextScope(myProxy.InnerChannel))


MessageHeader<string> myHeader = newMessageHeader<string>(“My header value);

MessageHeader headerWrapper = myHeader.GetUntypedHeader(“MyHeaderName”, “”);




There are a lot of other things you need to be careful about, like carefully closing the proxy, but that is also for another blog post.  Meantime, happy coding!


Fun with the Interns: Zach Montoya Builds a Visual Studio Designer for .NET Native

A few weeks ago when I was up in Redmond I had the pleasure of interviewing some interns on the .NET team to talk about their experience as an intern at Microsoft and to show off the projects they are working on. In this interview I sit down with Zach Montoya, a Developer Intern on the .NET Runtime team, and we chat about his internship experience and summer project.

Zach built a Visual Studio extension that allows maintaining and configuring runtime directives for .NET Native right from Visual Studio. .NET Native compiles .NET code to native machine code. Runtime directives are used to provide additional information to the .NET Native tool chain that tell the compiler what APIs you intend to call dynamically so those APIs can also be included.

Watch: Fun with the Interns: Zach Montoya Builds a Visual Studio Designer for .NET Native

And for all those students out there pursuing a career in computer science, you should consider an internship at Microsoft. You can help build real software that helps millions of people! Learn more about the Microsoft internship program here.


Retrospective on the Aug 14th VS Online outage

We had a pretty serious outage last Thursday all told it was a little over 5 hours.  The symptoms were that performance was so bad that the service was basically unavailable for most people (though there was some intermittent access as various mitigation steps were taken).  It started around 14:00 UTC and ended a little before 19:30 UTC.  This duration and severity makes this one of the worst incidents we’ve ever had on VS Online.

We feel terrible about it and continue to be committed to doing everything we can to prevent outages.  I’m sorry for the problems it caused.  The team worked tirelessly from Thursday through Sunday both to address the immediate health issues and to fix underlying bugs that might cause recurrences. 

As you might imagine, for the past week, we’ve been hard at work trying to understand what happened and what changes we have to make to prevent such things in the future.  It is often very difficult to find proof of the exact trigger for outages but you can learn a ton by studying them closely.

On an outage like this, there’s a set of questions I always ask, and they include:

What happened?

What happened was that one of the core SPS (Shared Platform Services) databases became overwhelmed with database updates and started queuing up so badly that it effectively blocked callers.  Since SPS is part of the authentication and licensing process, we can’t just completely ignore it – though I would suggest that if it became very sluggish, it wouldn’t be the end of the world if we bypassed some licensing checks to keep the service responsive.

What was the trigger?  What made it happen today vs yesterday or any other day?

Though we’ve worked hard on this question, we don’t have any definitive answer (we’re still pursuing it though).  We know that before the incident, some configuration changes were made that caused a significant increase in traffic between our “TFS” service and our “SPS” (Shared Platform Service).  That traffic involved additional license validation checks that had been improperly disabled.  We also know that, at about the same time, we saw a spike in latencies and failed deliveries of Service Bus messages.  We believe that one or both of these were key triggers but we are missing some logging on SPS database access to be able to be 100% certain.  Hopefully, in the next few days, we’ll know more conclusively.

What was the “root cause”?

This is different than the trigger in the sense that the trigger is often a condition that caused some cascading effect.  The root cause is more about understanding why the effect cascaded and why it took the system down.  It turns out that, I believe, the root cause was that we had accumulated a series of bugs that were causing extra SPS database work to be done and that the system was inherently unstable – from a performance perspective.  It just took some poke at the system – in the form of extra identity or licensing churn to cause a ripple effect on these bugs.  Most, but not all, of them were introduced in the last few sprints.  Here’s a list of the “core” causal bugs that we’ve found and fixed so far:

  1. Many calls from TFS -> SPS were inappropriately updating the “TFS service” identity’s properties. This created SQL write contention and invalidated the identity by sending a Service Bus message from SPS -> TFS. This message caused the app tiers to invalidate their cache and subsequent TFS requests to make a call to SPS causing further property updates and a vicious cycle.
  2. A bug in 401-handling code was making an update to the identity causing an invalidation of the identity’s cache – no vicious cycle but lots of extra cache flushes.
  3. A bug in the Azure Portal extension service was retrying 401s every 5sec.
  4. An old behavior that was causing the same invalidation ‘event’ to be resent from each SPS AT (user1 was invalidated on AT1, user2 was invalidated from AT2 -> user1 will be sent 2 invalidations).  And we have about 4 ATs so this can have a pretty nasty multiplicative effect.

We’ve also found/fixed a few “aggravating” bugs that made the situation worse but wouldn’t have been bad enough to cause serious issues on their own:

  1. Many volatile properties were being stored in Identity’s extended properties causing repeated cache invalidations and broad “change notifications” to be sent to listeners who didn’t care about the property changes.
  2. A few places were updating properties with unchanged values causing an unnecessary invalidation and SQL round trips.

All of these, in some form, have to do with updates to identities in the system that then often cause propagating change notifications (which in some cases were over propagated) that caused extra processing/updates/cache invalidations.  It was “unstable” because anything that caused an unexpected increased load in these identity updates would spiral out of control due to multiplicative effects and cycles.

What did we learn from the event?

I always want to look beyond the immediate and understand the underlying pattern.  This is sometimes called “The 5 whys”.  This is, in fact, the most important question in the list.  Why did this happen and what can we do differently?  Not what bugs did we hit.  Why were those bugs there?  What should we have done to ensure those bugs were caught in the design/development process before anything went into production?

Let me start with a story.  Way back in 2008, when we were beginning to rollout TFS across very large teams at Microsoft, we had a catastrophe.  We significantly underestimated the load that many thousands of people and very large scale build labs would put on TFS.  We lived in hell for close to 9 months with significant performance issues, painful daily slowdowns and lots of people sending me hate mail.

My biggest learning from that was, when it comes to performance, you can’t trust abstractions.  In that case, we were treating SQL Server as a relational database.  What I learned is that it’s really not.  It’s a software abstraction layer over disk I/O.  If you don’t know what’s happening at the disk I/O layer, you don’t know anything.  Your ignorance may be bliss – but when you get hit with 10x or 100x scale/performance requirement and you fall over dead.  We became very deep in SQL disk layout, head seeks, data density, query plans, etc.  We optimized the flows from the top to the very bottom and made sure we knew where all the CPU went, all the I/Os went, etc.  When we were done, TFS scaled to crazy large teams and code bases.

We then put in place regression tests that would measure changes, not just in time but also in terms of SQL round trips, etc.

So back to last Thursday…  We’ve gotten sloppy.  Sloppy is probably too harsh.  As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for.  In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing.  This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations.  You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests.

So, we’ve got some bugs to fix and some more debt to pay off in terms of tuning the interaction between TFS and SPS but, most importantly, we need to put in place some infrastructure to better measure and flag changes in end to end cost – both in test and in production.

The irony here (not funny irony but sad irony), is that there has been some renewed attention on this in the team recently.  A few weeks ago, we had a “hack-a-thon” for small groups of people on the team to experiment with new ideas.  One of the teams built a prototype of a solution for capturing important performance tracing information across the end-to-end thread of a request.  I’ll try to do a blog post in the next couple of weeks to show some of these ideas.  And just the week before this incident Buck (our Dev director) and I were having a conversation about needing to invest more in this very scenario.  Unfortunately we had a major incident before we could address the gap.

What are we going to do?

OK, so we learned a lot, but what are we actually going to do about it.  Clearly step 1 is mitigate the emergency and get the system back to sustainable health quickly.  I think we are there now.  But we haven’t addressed the underlying whys yet.  So, some plans we are making now include:

  1. We will analyze call patterns within SPS and between SPS and SQL and build the right telemetry and alerts to catch situations early.  Adding baselines into unit and functional tests will enforce that baselines don’t get violated when a dev checks in code.
  2. Partitioning and scaling of SPS Config DB will be a very high priority. With the work to enable tenant-level hosts, we can partition identity related information per tenant.  This enables us to scale SPS data across databases, enabling a higher “ceiling” and more isolation in the event things ever go badly again.
  3. We are looking into building an ability for a service to throttle and recover itself from a slow or failed dependency. We should leverage the same techniques for TFS -> SPS communication and let TFS leverage cached state or fail gracefully.  (This is actually also a remaining action item from the previous outage we had a months or so ago.)
  4. We should test our designs for lag in Service Bus delivery and ensure that our functionality continues to work or degrades gracefully.
  5. Moving to Service Bus batching APIs and partitioned topics will help us scale better to handle very ‘hot’ topics like Identity.

As always, hopefully you get some value from the details behind the mistakes we make.  Thank you for bearing with us.


Developers, .Net Framework vNext 4.5.2 and its Support Life Cycle

.Net Framework vNext 4.5.2 : Now you can download the .Net Framework 4.5.2 (also known as .Net Framework vNext), a highly compatible, in-place update to the .NET 4.x family (.NET 4, 4.5, and 4.5.1). It gives you the benefits of the greater stability, reliability, security and performance without any action beyond installing the .NET 4.5.2 update i.e., there is no need to recompile your application to get these benefits. For more info, you can refer . What’s there for Developers ? – Checkout the official…(read more)

Charting new territory at APC

Just over a week to go now until we kick off Australian Partner Conference 2014. We’ve revealed most of the agenda on the website and I hope you’ll agree there are some great speakers and sessions lined up! I’m pleased to make an addition to the agenda today in partnership with CRM – “Charting New Territory” is the topic for our industry partner panel which we’ll be hosting on Wednesday afternoon just before the networking drinks. Hopefully plenty to debate and it should provide some great fuel for further discussion into the evening.

See you on the Gold Coast



Charting new territory - The roadmap that got us here might not get us where we need to go

In a discussion chaired by CRN magazine, leading partners reveal how they are evolving their businesses for the era of cloud, mobile and more. Hear from some of the best know names in the industry on their perspectives on the changes surrounding us and how their businesses are changing to capture new opportunities. 




Dimension Data

Brian Walshe, National General Manager, End User Computing

Brian has worked for global solution provider Dimension Data for more than a decade. Dimension Data has over 15 years of experience deploying on-premises Exchange, SharePoint and Lync solutions for clients in Australia and has migrated more than one million end users to public cloud business productivity solutions, with Brian overseeing many of these projects.



Scott Gosling, National Practice Manager, Microsoft Services

Scott leads Data#3′s Microsoft solutions practice nationally, and develops their strategic initiatives, including leading their transition to cloud. He also sits on the Microsoft Australia Security Partner Advisory Council, and the Microsoft Worldwide
Infrastructure & Cloud Partner Advisory Council. Scott is also a member of the Griffith University Industry Advisory Board.


Brennan IT

Dave Stevens, Managing Director

Dave founded this mid market-focused IT provider in 1997 and has led it to be one of the country’s most respected technology businesses, with more than 200 staff in offices in Sydney, Melbourne, Brisbane and Newcastle. Brennan IT’s broad array of services span cloud, data networking, phones, video and unified communications, IT security, software development, IT services & support, hardware procurement and software licensing.



Nicki Bowers, Managing Director

Nicki runs one of the fastest-growing and most successful ‘born-in-the-cloud’ Microsoft partners in the country. Established in 2010, Kloud reached 130 staff and over $30 million of sales in less than four years of operation. Nicki’s credentials includes roles at Microsoft, Dimension Data, Compaq and BHP Billiton.



Graeme Strange, Managing Director (FAICD)

Graeme has headed up this software services business since 2007. In 2013, Readify secured $26 million of investment from Blue Sky Private Equity, including $10 million for acquisitions. In the same year, Readify was named Microsoft’s  Worldwide, Australian Country Partner of the Year, following closely on the heels of being named as Microsoft’s Global Software Development Partner of the Year. The company has seen eight years of rapid and profitable growth.


Moderated by Steven Kiernan, Editor, CRN Australia

Steven heads up this leading print and online channel publication, the Australian home of the world’s biggest media brand for resellers, systems integrators, managed service providers, ISVs, distributors and vendors. Steven and his team strive
to provide the best content to meet the needs of Australian businesses in the rapidly evolving IT channel.


App-V 5: Revisiting Registry Staging

Last year, I wrote a blog post on registry staging in App-V and how it can affect initial publishing times – particularly on VDI ( ) and many people wrote me to tell me it was very helpful in reducing publishing times. This was especially important especially for those non-persistent VDI and RDS/XenApp environments.
A colleague at Microsoft recently reminded me…(read more)