Skip to content

Feed aggregator

Voron internals: The transaction journal & recovery

Ayende @ Rahien - 17 hours 7 min ago

imagebot

In the previous post, I talked about the usage of scratch files to enable MVCC and the challenges that this entails. In this post, I want to talk about the role the transaction journal files play in all of this. I talked a lot about how to ensure that transaction journals are fast, what goes into them, etc. But this post is how  they are used inside Voron.

The way Voron stores data inside the transaction journal is actually quite simple. We have a transaction header, which contains quite a bit of interesting information, and then we have all the pages that were modified in this transaction, compressed.

image

The fact that we are compressing pages can save on a lot of the amount of I/O we write. But the key aspect here is that a transaction is considered committed by the Voron when we complete the write of the entire thing to stable storage. See the post above to a complete discussion on why it matters and how to do this quickly and with the least amount of pain.

Typically, the transaction journal is only used during recovery, so it is write only. We let the journal files to grow to about 64MB in size, then we create new ones. During database startup, we check what is the last journal file and journal file position that we have synced (more on that later), and we start reading from there. We read the transaction header and compare its hash to the hash of the compressed data. If they match (as well as a bunch of other checks we do), then we consider this to be a valid commit, and then we decompress the data into a temporary buffer and we have all the dirty pages that were written in that transaction.

We can then just copy them to the appropriate location in the data file. We continue doing so until we hit the end of the last file or we hit a transaction which is invalid or empty. At that point we stop, consider this the end of the valid committed transactions, and complete recovery.

Note that at this point, we have written a lot of stuff to the data file, but we have flushed it. The reason is that flushing is incredibly expensive, especially during data recovery where we might be re-playing a lot of data. So we skip it.  Instead, we rely on the normal flushing process to do this for us. By default, this will happen within 1 minute of the database starting up, in the background, so it will reduce the interruption for regular operations. This gives us a very fast startup time. And our in memory state let us know where is the next place we need to flush from the log, so we don’t do the same work twice.

However, that does mean that if we fail midway through, there is absolutely no change in behavior. In recovery, we’ll write the same information to the same place, so replaying the journal file become idempotent operation that can fail and recover without a lot of complexity.

We do need to clear the journal files at some point, and this process happens after we synced the data file. At that point, we know that the data is safely stored in the data file, and we can update our persistent state on where we need to start applying recovery the next time. Once those two actions are done, we can delete the old (and now unused) journal files. Note that at each part of the operation, the failure mode is to simply retry the idempotent operation (copying the pages from the journal to the data file), and there is no need for complex recovery logic.

During normal operation, we’ll clear the journal files once it has been confirmed that all the data it has was successfully flushed to the disk and that this action has been successfully recorded in stable storage. So in practice, database restarts after recovery are typically very fast, only needing to reply the last few transactions before we are ready for business again.

Categories: Blogs

Top 10 Most Popular Articles Of Java

C-Sharpcorner - Latest Articles - 19 hours 7 min ago
We have compiled a list of top 10 most popular Java articles.
Categories: Communities

Why Is .NET Core Important?

C-Sharpcorner - Latest Articles - 19 hours 7 min ago
Be aware of the opportunities of .NET Core.
Categories: Communities

Creating Web Based Music Player System Using MVC and AngularJs

C-Sharpcorner - Latest Articles - 19 hours 7 min ago
In this article, we will learn how to create a simple web based music player system using MVC, AngularJS and Web API.
Categories: Communities

Build Your First AngularJS 2 Application With TypeScript

C-Sharpcorner - Latest Articles - 19 hours 7 min ago
In this article, you will learn how to build your first AngularJS 2 application with TypeScript.
Categories: Communities

Power BI at SQL Saturday Portland Oct 22

This year SQL Saturday has dedicated an entire track to Power BI.  They are still deciding which sessions they will be hosting, but the initial submissions are looking great!

In addition to the Power BI sessions CSG will also be offering a Dashboard in an hour Session!

If you are are not familiar w/ this event, SQLSaturday is a free training event for SQL Server professionals and those wanting to learn about SQL Server. This event will be held on Oct 22 2016 at Washington State University Vancouver, 14204 NE Salmon Creek Ave , Vancouver, Washington, 98686, United States

For more information Check out: http://www.sqlsaturday.com/572/eventhome.aspx

image

 

Session Title 
  Title Speaker 1 Analyzing your online presence using Power Bl


Asgeir Gunnarsson

2 Analyzing SQL Server Data usin PowerPivot in MS Excel


Wylie Blanchard

3 Azure Machine Learning: From Design to Integration Peter Myers 4 Code Like a Pirate Intro to R and Data Science Tools in MS Jamey Johnston 5 Mobile, Paginated, KPls, and Power Bl, Oh My! SSRS 2016 Reporting Steve Wake 6 Power Bl for the Developer Peter Myers 7 Reports on the Run: Mobile Reporting with SQL Server 2016 Peter Myers 8 SQL Server R Services in SQL 2016 Chris Hyde 9

Calling REST APIs, working with JSON and integrating with your Web Development using Power BI

Charles Sterling

 

Calling REST APIs, working with JSON and integrating with your Web Development using Power BI

Charles Sterling shows how to use Power BI in your development efforts specifically how to call REST APIs with Power BI without writing any code.  Parsing, modeling and transforming the resulting JSON to make creating rich interactive reports a snap and integrating this into your  development efforts by embedding the Power BI data visualizations into your web applications.

http://aka.ms/chassbio

Categories: Companies

Power BI Dashboard in a Day and Dashboard in an Hour Training Near you

One of the top request we hear from Power BI customers is: 

“Where can i find Power BI Training?”.

In addition to a LOT of custom training by the Power BI Partners, the development team works on three training courses that we update monthly:

  1. Product based, Guided Learning https://powerbi.microsoft.com/en-us/guided-learning/
  2. Visualizing Data EDX Course: https://www.edx.org/course/analyzing-visualizing-data-power-bi-microsoft-dat207x-3
  3. Dashboard in a Day (DIAD) and Dashboard in an Hour Training

The Dashboard in a Day and Dashboard in an Hour training is training that we update monthly to show off new features and make available to Usergroups, MVPs and of course our Power BI partners to offer their customers.

Over the next two months you can find this content at several cities near you!

1. Seattle User Group DIAH Sept 21st

3.  Chicago Dashboard in a day Sept 21st

4. Portland Dashboard in an Hour Sept 29th

5.  Boston Dashboard in a day Oct 5th

6.  Atlanta Dashboard in a day Oct 19th

7.  Portland SQL Saturday Oct 22nd

 

 

image

Problem Statement …………………………………………………………………………………………………………………………………………………………………………… 3
Document Structure ………………………………………………………………………………………………………………………………………………………………………….. 3
Prerequisites …………………………………………………………………………………………………………………………………………………………………………………….. 4
Power BI Desktop – Get Data ……………………………………………………………………………………………………………………………………………………………… 5
Power BI Desktop – Manage Relationship ……………………………………………………………………………………………………………………………………………. 8
Power BI Desktop – Create Report ………………………………………………………………………………………………………………………………………………………. 9
Power BI Service – Import Report …………………………………………………………………………………………………………………………………………………….. 15
Power BI Service – Create Dashboard ……………………………………………………………………………………………………………………………………………….. 17
Power BI Service – Power Q & A ……………………………………………………………………………………………………………………………………………………….. 24
Power BI Service – Share Dashboard …………………………………………………………………………………………………………………………………………………. 25
References ……………………………………………………………………………………………………………………………………………………………………………………… 27

 

image

Overview ……………………………………………………………………………………………………………………………………………………………………………………………… 5
Introduction ……………………………………………………………………………………………………………………………………………………………………………………… 5
Data Set……………………………………………………………………………………………………………………………………………………………………………………………. 5
Course Outline ………………………………………………………………………………………………………………………………………………………………………………….. 5
Power BI Desktop …………………………………………………………………………………………………………………………………………………………………………………. 6
Importing Data ………………………………………………………………………………………………………………………………………………………………………………….. 6
Transforming your Data ……………………………………………………………………………………………………………………………………………………………………. 17
Interactive Data Exploration ……………………………………………………………………………………………………………………………………………………………… 26
Power BI Service – Part I ………………………………………………………………………………………………………………………………………………………………………. 79
Power BI Service – Creating Dashboard and uploading your Report ……………………………………………………………………………………………………….. 79
Power BI Service – Operational Dashboard and Sharing ……………………………………………………………………………………………………………………….. 83
Power BI Service – Refreshing data on the Dashboard ………………………………………………………………………………………………………………………… 101
Power BI Service — Part II …………………………………………………………………………………………………………………………………………………………………… 112
Distributing content to larger audiences for them to customize ………………………………………………………………………………………………………….. 112
View and manage your Excel reports in Power BI ………………………………………………………………………………………………………………………………. 127
Collaboration via Office 365 Groups …………………………………………………………………………………………………………………………………………………. 134
References ……………………………………………………………………………………………………………………………………………………………………………………….. 137

Categories: Companies

Power BI training in Portland: Dashboard in an Hour by CSG Sept 29th

 

 

CSG is taking the DIAH training and making available FOR FREE!

Additionally the folks at CSG  will be offering a Dashboard in an hour session available at SQL Saturday http://www.sqlsaturday.com/572/eventhome.aspx 

 

image

 

What is Dashboard in an Hour?

A hands-on session using Microsoft Power BI to show you how to build a Power BI dashboard from Excel spreadsheets, or a local/public database. It is led by Microsoft Gold Partner, CSG Pro.
Who is the session for?

Anyone interested in Microsoft Power BI who’d like to gain greater insights from their data.
When is the Dashboard in an Hour?

September 29th, 2016, 9:00 to 10:00 a.m. Coffee, carbs and set-up starts at 8:30 a.m. For those who desire, an optional Q&A and 1:1 is available with our Power BI subject matter experts from 10:00 to 10:30 a.m.
Where?

Microsoft Portland Office (in the Pearl District), 1414 NW Northrup St, Suite 900, Portland, OR 97209. 
Cost and Prerequisites.

There is no cost. This is a Bring Your Own Device (BYOD) session.
In case you don’t know, or want to know a little more, just what is Power BI?

Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Whether your data is a simple Excel spreadsheet, or a collection of cloud-based and on-premises hybrid data warehouses, Power BI lets you easily connect to your data sources, visualize (or discover) what’s important, and share that with anyone or everyone you want.

To register:

https://www.csgpro.com/events

 

image

Problem Statement …………………………………………………………………………………………………………………………………………………………………………… 3
Document Structure ………………………………………………………………………………………………………………………………………………………………………….. 3
Prerequisites …………………………………………………………………………………………………………………………………………………………………………………….. 4
Power BI Desktop – Get Data ……………………………………………………………………………………………………………………………………………………………… 5
Power BI Desktop – Manage Relationship ……………………………………………………………………………………………………………………………………………. 8
Power BI Desktop – Create Report ………………………………………………………………………………………………………………………………………………………. 9
Power BI Service – Import Report …………………………………………………………………………………………………………………………………………………….. 15
Power BI Service – Create Dashboard ……………………………………………………………………………………………………………………………………………….. 17
Power BI Service – Power Q & A ……………………………………………………………………………………………………………………………………………………….. 24
Power BI Service – Share Dashboard …………………………………………………………………………………………………………………………………………………. 25
References ……………………………………………………………………………………………………………………………………………………………………………………… 27

 

 

Of course be sure and ask them about advanced training …like the entire Dashboard in a Day class:

 

image

 

Overview ……………………………………………………………………………………………………………………………………………………………………………………………… 5
Introduction ……………………………………………………………………………………………………………………………………………………………………………………… 5
Data Set……………………………………………………………………………………………………………………………………………………………………………………………. 5
Course Outline ………………………………………………………………………………………………………………………………………………………………………………….. 5
Power BI Desktop …………………………………………………………………………………………………………………………………………………………………………………. 6
Importing Data ………………………………………………………………………………………………………………………………………………………………………………….. 6
Transforming your Data ……………………………………………………………………………………………………………………………………………………………………. 17
Interactive Data Exploration ……………………………………………………………………………………………………………………………………………………………… 26
Power BI Service – Part I ………………………………………………………………………………………………………………………………………………………………………. 79
Power BI Service – Creating Dashboard and uploading your Report ……………………………………………………………………………………………………….. 79
Power BI Service – Operational Dashboard and Sharing ……………………………………………………………………………………………………………………….. 83
Power BI Service – Refreshing data on the Dashboard ………………………………………………………………………………………………………………………… 101
Power BI Service — Part II …………………………………………………………………………………………………………………………………………………………………… 112
Distributing content to larger audiences for them to customize ………………………………………………………………………………………………………….. 112
View and manage your Excel reports in Power BI ………………………………………………………………………………………………………………………………. 127
Collaboration via Office 365 Groups …………………………………………………………………………………………………………………………………………………. 134
References ……………………………………………………………………………………………………………………………………………………………………………………….. 137

Categories: Companies

Eight scenarios with Apache Spark on Azure that will transform any business

This post was authored by Rimma Nehme, Technical Assistant, Data Group.

Spark-Azure

Since its birth in 2009, and the time it was open sourced in 2010, Apache Spark has grown to become one of the largest open source communities in big data with over 400 organizations from 100 companies contributing to it. Spark stands out for its ability to process large volumes of data 100x faster, because data is persisted in-memory. Azure cloud makes Apache Spark incredibly easy and cost effective to deploy with no hardware to buy, no software to configure, with a full notebook experience to author compelling narratives, and integration with partner business intelligence tools. In this blog post, I am going to review of some of the truly game-changing usage scenarios with Apache Spark on Azure that companies can employ in their context.

Scenario #1: Streaming data, IoT and real-time analytics

Apache Spark’s key use case is its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for companies to be able to stream and analyze it all in real time. Spark Streaming has the capability to handle this type of workload exceptionally well. As shown in the image below, a user can create an Azure Event Hub (or an Azure IoT Hub) to ingest rapidly arriving data into the cloud; both Event and IoT Hubs can intake millions of events and sensor updates per second that can then be processed in real-time by Spark.

Scenario 1_Spark Streaming

Businesses can use this scenario today for:

  • Streaming ETL: In traditional ETL (extract, transform, load) scenarios, the tools are used for batch processing, and data must be first read in its entirety, converted to a database compatible format, and then written to the target database. With Streaming ETL, data is continually cleaned and aggregated before it is pushed into data stores or for further analysis.
  • Data enrichment: Streaming capability can be used to enrich live data by combining it with static or ‘stationary’ data, thus allowing businesses to conduct more complete real-time data analysis. Online advertisers use data enrichment to combine historical customer data with live customer behavior data and deliver more personalized and targeted ads in real-time and in the context of what customers are doing. Since advertising is so time-sensitive, companies have to move fast if they want to capture mindshare. Spark on Azure is one way to help achieve that.
  • Trigger event detection: Spark Streaming can allow companies to detect and respond quickly to rare or unusual behaviors (“trigger events”) that could indicate a potentially serious problem within the system. For instance, financial institutions can use triggers to detect fraudulent transactions and stop fraud in its tracks. Hospitals can also use triggers to detect potentially dangerous health changes while monitoring patient vital signs and sending automatic alerts to the right caregivers who can then take immediate and appropriate action.
  • Complex session analysis: Using Spark Streaming, businesses can use events relating to live sessions, such as user activity after logging into a website or application, can be grouped together and quickly analyzed. Session information can also be used to continuously update machine learning models. Companies can then use this functionality to gain immediate insights as to how users are engaging on their site and provide more real-time personalized experiences.
Scenario #2: Visual data exploration and interactive analysis

Using Spark SQL running against data stored in Azure, companies can use BI tools such as Power BI, PowerApps, Flow, SAP Lumira, QlikView and Tableau to analyze and visualize their big data. Spark’s interactive analytics capability is fast enough to perform exploratory queries without sampling. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively. These easy-to-use interfaces then allow even non-technical users to visually explore data, create models and share results. Because wider audience can analyze big data without preconceived notions, companies can test new ideas and visualize important findings in their data earlier than ever before. Companies can identify new trends and new relationships that were not apparent before and quickly drill down into them, ask new questions and find ways to innovate in new and smarter ways.

Scenario 2_Spark visual data exploration and interactive analysis

This scenario is even more powerful when interactive data discovery is combined with predictive analytics (more on this later in this blog). Based on relationships and trends identified during discovery, companies can use logistic regression or decision tree techniques to predict the probability of certain events in the future (e.g., customer churn probability). Companies can then take specific, targeted actions to control or avert certain events.

Scenario #3: Spark with NoSQL (HBase and Azure DocumentDB)

This scenario provides scalable and reliable Spark access to NoSQL data stored either in HBase or our blazing fast, planet-scale Azure DocumentDB, through “native” data access APIs. Apache HBase is an open-source NoSQL database that is built on Hadoop and modeled after Google BigTable. DocumentDB is a true schema-free managed NoSQL database service running in Azure designed for modern mobile, web, gaming, and IoT scenarios. DocumentDB ensures 99% of your reads are served under 10 milliseconds and 99% of your writes are served under 15 milliseconds. It also provides schema flexibility, and the ability to easily scale a database up and down on demand.

The Spark with NoSQL scenario enables ad-hoc, interactive queries on big data. NoSQL can be used for capturing data that is collected incrementally from various sources across the globe. This includes social analytics, time series, game or application telemetry, retail catalogs, up-to-date trends and counters, and audit log systems. Spark can then be used for running advanced analytics algorithms at scale on top of the data coming from NoSQL.

Scenario 3_Spark NoSQL

Companies can employ this scenario in online shopping recommendations, spam classifiers for real time communication applications, predictive analytics for personalization, and fraud detection models for mobile applications that need to make instant decisions to accept or reject a payment. I would also include in this category a broad group of applications that are really “next-gen” data warehousing, where large amounts of data needs to be processed inexpensively and then served in an interactive form to many users globally. Finally, internet of things scenarios fit in here as well, with the obvious difference that the data represents the actions of machines instead of people.

Scenario #4: Spark with Data Lake

Spark on Azure can be configured to use Azure Data Lake Store (ADLS) as an additional storage. ADLS is an enterprise-class, hyper-scale repository for big data analytic workloads. Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts in an enterprise environment to store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. Because ADLS is a file system compatible with Hadoop Distributed File System (HDFS), it makes it very easy to combine it with Spark for running computations at scale using pre-existing Spark queries.

Scenario 4_Spark with Data Lake

The data lake scenario arose because new types of data needed to be captured and exploited by companies, while still preserving all of the enterprise-level requirements like security, availability, compliance, failover, etc. Spark with data lake scenario enables a truly scalable advanced analytics on healthcare data, financial data, business-sensitive data, geo-location coordinates, clickstream data, server log, social media, machine and sensor data. If companies want an easy way of building data pipelines, have unparalleled performance, insure their data quality, manage access control, perform change data capture (CDC) processing, get enterprise-level security seamlessly and have world-class management and debugging tools, this is the scenario they need to implement.

Scenario #5: Spark with SQL Data Warehouse

While there is still a lot of confusion, Spark and big data analytics is not a replacement for traditional data warehousing. Instead, Spark on Azure can complement and enhance a company’s data warehousing efforts by modernizing the company’s approaches to analytics. A data warehouse can be viewed as an ‘information archive’ that supports business intelligence (BI) users and reporting tools for mission-critical functions of company. My definition of mission-critical is any system that supports revenue generation or cost control. If such a system fails, companies would have to manually perform these tasks to prevent loss of revenue or increased cost. Big data analytics systems like Spark help augment such systems by running more sophisticated computations, smarter analytics and delivering deeper insights using larger and more diverse datasets.

Azure SQL Data Warehouse (SQLDW) is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on our massively parallel processing (MPP) architecture, SQLDW combines the power of the SQL Server relational database with Azure cloud scale-out capabilities. You can increase, decrease, pause, or resume a data warehouse in seconds with SQLDW. Furthermore, you save costs by scaling out CPU when you need it and cutting back usage during non-peak times. SQLDW is the manifestation of elastic future of data warehousing in the cloud.

Scenario 5_Spark with SQLDW

Some of the use cases of Spark with SQLDW scenario may include: using data warehouse to get a better understanding of its customers across product groups, then using Spark for predictive analytics on top of that data. Running advanced analytics using Spark on top of the enterprise data warehouse containing sales, marketing, store management, point of sale, customer loyalty, and supply chain data, then run advanced analytics using Spark to drive more informed business decisions at the corporate, regional, and store levels. Using Spark with the data warehousing data, companies can literally do anything from risk modeling, to parallel processing of large graphs, to advanced analytics, text processing – all on top of their elastic data warehouse.

Scenario #6: Machine Learning using R Server, MLlib

Another and probably one of the most prominent Spark use cases in Azure is machine learning. By storing datasets in-memory during a job, Spark has great performance for iterative queries common in machine learning workloads. Common machine learning tasks that can be run with Spark in Azure include (but are not limited to) classification, regression, clustering, topic modeling, singular value decomposition (SVD) and principal component analysis (PCA) and hypothesis testing and calculating sample statistics.

Typically, if you want to train a statistical model on very large amounts of data, you need three things:

  • Storage platform capable of holding all of the training data
  • Computational platform capable of efficiently performing the heavy-duty mathematical computations required
  • Statistical computing language with algorithms that can take advantage of the storage and computation power

Microsoft R Server, running on HDInsight with Apache Spark provides all three things above. Microsoft R Server runs within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR takes advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. With multi-threaded math libraries and transparent parallelization in R Server, customers can handle up to 1000x more data and up to 50x faster speeds than open source R. And if your data grows or you just need more power, you can dynamically add nodes to the Spark cluster using the Azure portal. Spark in Azure also includes MLlib for a variety of scalable machine learning algorithms, or you can use your own libraries. Some of the common applications of machine learning scenario with Spark on Azure are listed in a table below.

Vertical Sales and Marketing Finance and Risk Customer and Channel Operations and Workforce Retail Demand forecasting

Loyalty programs

Cross-sell and upsell

Customer acquisition Fraud detection

Pricing strategy Personalization

Lifetime customer value

Product segmentation Store location demographics

Supply chain management

Inventory management Financial Services Customer churn

Loyalty programs

Cross-sell and upsell

Customer acquisition Fraud detection

Risk and compliance

Loan defaults Personalization

Lifetime customer value Call center optimization

Pay for performance Healthcare Marketing mix optimization

Patient acquisition Fraud detection

Bill collection Population health

Patient demographics Operational efficiency

Pay for performance Manufacturing Demand forecasting

Marketing mix optimization Pricing strategy

Perf risk management Supply chain optimization

Personalization Remote monitoring

Predictive maintenance

Asset management

 

Scenario 6_Spark Machine Learning

Examples with just a few lines of code that you can try out right now:

Scenario #7: Putting it all together in a notebook experience

For data scientists, we provide out-of-the-box integration with Jupyter (iPython), the most popular open source notebook in the world. Unlike other managed Spark offerings that might require you to install your own notebooks, we worked with the Jupyter OSS community to enhance the kernel to allow Spark execution through a REST endpoint.

We co-led “Project Livy” with Cloudera and other organizations to create an open source Apache licensed REST web service that makes Spark a more robust back-end for running interactive notebooks.  As a result, Jupyter notebooks are now accessible within HDInsight out-of-the-box. In this scenario, we can use all of the services in Azure mentioned above with Spark with a full notebook experience to author compelling narratives and create data science collaborative spaces. Jupyter is a multi-lingual REPL on steroids. Jupyter notebook provides a collection of tools for scientific computing using powerful interactive shells that combine code execution with the creation of a live computational document. These notebook files can contain arbitrary text, mathematical formulas, input code, results, graphics, videos and any other kind of media that a modern web browser is capable of displaying. So, whether you’re absolutely new to R or Python or SQL or do some serious parallel/technical computing, the Jupyter Notebook in Azure is a great choice.

Scenario 7_Spark with Notebook

You can also use Zeppelin notebooks on Spark clusters in Azure to run Spark jobs. Zeppelin notebook for HDInsight Spark cluster is an offering just to showcase how to use Zeppelin in an Azure HDInsight Spark environment. If you want to use notebooks to work with HDInsight Spark, I recommend that you use Jupyter notebooks. To make development on Spark easier, we support IntelliJ Spark Tooling which introduces native authoring support for Scala and Java, local testing, remote debugging, and the ability to submit Spark applications to the Azure cloud.

Scenario #8: Using Excel with Spark

As a final example, I wanted to describe the ability to connect Excel to Spark cluster running in Azure using the Microsoft Open Database Connectivity (ODBC) Spark Driver. Download it here.

Scenario 8_Spark with Excel

Excel is one of the most popular clients for data analytics on Microsoft platforms. In Excel, our primary BI tools such as PowerPivot, data-modeling tools, Power View, and other data-visualization tools are built right into the software, no additional downloads required. This enables users of all levels to do self-service BI using the familiar interface of Excel. Through a Spark Add-in for Excel users can easily analyze massive amounts of structured or unstructured data with a very familiar tool.

Conclusion

Above, I’ve described some of the amazing, game-changing scenarios for real-time big data processing with Spark on Azure. Any company across the globe, from a huge enterprise to a small startup can take their business to the next level with these scenarios and solutions. The question is, what are you waiting for?

Categories: Companies

Voron internals: Cleaning up scratch buffers

Ayende @ Rahien - Mon, 08/29/2016 - 10:00

In my previosus post, I talked about how Voron achieves MVCC. Instead of modifying data in place, we copy the page or pages we want to modify to a scratch buffer and modify that. When the write transaction completes, we are updating a Page Translation Table so any reference to the pages that were modified would go to the right place in the scratch file.

Note, Voron uses mmap files as scratch buffers. I use the term scratch buffer / scratch file to refer to the same thing.

That is all well and good, and if you are familiar with how virtual memory works, this is exactly the model. In effect, every transaction get a snapshot of the entire database as it was when it was opened. Read transactions don’t modify the data, and are ensured to have a stable snapshot of the database. The write transaction can modify the database freely, without worrying about locking or stepping over other transactions.

This is all pretty simple, and the sole cost that we have when committing the transaction is flushing all the dirty pages to disk, and then making an atomic pointer swap to update the Page Translation Table.

However, that is only part of the job, if all the data modifications happens on the scratch buffer, what is going on with the scratch files?

Voron has a background process that monitor the database activity, and based on certain policy (size, time, load factor, etc) it will routinely write the data from the scratch files to the data file. This is a bit of an involved process, because we can’t just do this blindly.

Instead, we start by seeing what is the oldest active transaction that is currently operating. We need to find that out to make sure that we aren’t writing any page that this transaction might visit (thus violating the snapshot isolation of the transaction). Once we have the oldest transaction, we gather all the pages from the Page Translation Table that came from older transactions and write them to the data file. There are a couple of tricks that we use here. It is very frequent for the same page to be modified multiple times (maybe we updated the record several times in different transactions), so we’ll have multiple copies of it. But we don’t actually need to copy all of them, we just need to copy the latest version (up to the oldest active transaction).

The process of copying all the data from the scratch file to the data file can happen concurrently with both read and write transactions. After the flush, we need to update the PTT again (so we open a very short write transactions to do that), and we are done. All the pages that we have copied from the scratch buffer are marked as free and are available for future transactions to use. 

Note, however, that we haven’t called fsync on the data file yet. So even though we wrote to the data file, we made a buffered write, which is awesome for performance, but not so much for safety. This is done intentionally, for performance reasons. In my next post, I’ll talk about recovery and safety at length, so I’ll just mention that we fsync the data file once a minute or one once every 2GB or so. The idea is that we give the OS the time to do the actual flush on the background, before we just in and demand that this will happen.

Another problem that we have with the scratch buffer is that, like any memory allocation routine, it has issues. In particular, it has to deal with fragmentation. We use power of two allocator to reduce fragmentation as much as possible, but certain workloads can fragment the memory in such a way that it is hard / impossible to deal with it. In order to deal with that issue, we keep track on not just the free sections in the scratch buffer, but also on the total amount of used memory. If a request cannot be satisfied by the scratch buffer because of fragmentation, but there is enough free space available, we’ll create a new scratch file and use that as our new scratch. The old one will eventually be freed when all read transactions are over and all the data has been flushed away.

Scratch files are marked as temporary and delete of close, so we don’t actually incur a high I/O cost when we create new ones, and it typically only when we have very high workload of both reads and writes that we see the need to create new scratch files.This tend to be drastically cheaper than trying to do compaction, and it actually work in all cases, while compaction can fail in many cases.

You might have noticed an issue with the whole system. We can only move pages from the scratch file to the data file if it was modified by a transaction that is older than the oldest current transaction. That means that a long running read transaction can stall the entire process. This typically is only a problem when we are seeing very high write usage as well as very long read transactions, which pushes the envelope on the size of the scratch buffer but at the same time doesn’t allow to clean it.

Indeed, using Voron, you are typically aware on the need to close transactions in a reasonable timeframe. Within RavenDB, there are very few places where a transaction can span a long time (streaming is pretty much the only case in which we’ll allow it, and it is documented that if you have a very long streaming request, that push memory usage on the server up because we can’t clean the transaction). In practice, even transactions that takes multiple minutes are fine under moderate write load, because there is enough capacity to handle it.

Categories: Blogs

Entity Framework - Basic Guide To Perform Fetch, Filter, Insert, Update And Delete

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn about basic guide to perform fetch, filter, insert, update and delete in Entity Framework.
Categories: Communities

AngularJS Filters

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, we will learn about AngularJS Filters.
Categories: Communities

How To Increase The Number Of Jump Lists In Windows 10

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn, how to increase the number of Jump Lists in Windows 10.
Categories: Communities

Hamburger Menu For Windows 10 UWP App Using Community Toolkit

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, we are going to see, how to use hamburger menu in Windows 10 UWP app, using UWP community toolkit.
Categories: Communities

How To Reset Your Network In Windows 10

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn, how to reset your network in Windows 10.
Categories: Communities

Bind Add Update Delete Data Using MVC Entity Framework And LINQ

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In the article, you will learn how to add, update, and delete data, using Entity Framework and LINQ. This is for the beginner users like me in MVC.
Categories: Communities

Functions In Swift Programming Language

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn about functions in Swift programming language.
Categories: Communities

How To Run Xamarin Forms Application In Universal Windows Platform Development With XAML And C#

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn how to run Xamarin Forms application in UWP development with XAML and Visual C#.
Categories: Communities

Nested Kendo Grid Using Angular

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn about nested Kendo Grid, using Angular.
Categories: Communities

Deep Dive With CSS - Introduction

C-Sharpcorner - Latest Articles - Mon, 08/29/2016 - 08:00
In this article, you will learn how to work with CSS.
Categories: Communities