Data architect is responsible for finding solutions around data management. Large amount of data can be a big problem; if proper decisions are not taken around data. In this blog we will share some important pointers, and some sample questions for data architect exam.
Let ’s start with Large volumes of data
Data is one of the key elements of any application. Users constantly create data. All day long. Every day. So. Much. Data. Suddenly your org has accumulated millions of records, thousands of users, and several gigabytes of data storage.
These large data volumes (LDV) can lead to sluggish performance, including slower queries, slower search and list views, and slower sandbox refreshing. You can avoid this predicament if you plan for accommodating LDV up front, designing your data model to build scalability in from the get-go.
Large data can be managed in following ways-
- Avoid data skew-A key for managing large data volumes for peak performance is carefully architecting record ownership to avoid data skew. Data skew happens when more than 10,000 child records are associated with the same parent record within an org. Data skew can be of many types Ownership skew, Lookup Skew, Account data skew.
- Use External Data Objects-Another strategy for LDV is using external objects—which means there’s no need to bring data into Salesforce. With a data-tearing strategy that spreads data across multiple objects and brings it in on demand from another object or external store, you avoid both storing large amounts of data in your org, and the performance issues associated with LDV.
- Create efficient queries taking advantage of indexed fields. SOQL optimizer can be used to optimize queriesThe less data that your query returns, the better it is .Use indexed fields in where clause of query, we can request salesforce support team for custom indexesAvoid queries that require full table scan example- Querying for null rows—Queries that look for records in which the field is empty or null. For example: SELECT Id, Name FROM Account WHERE Custom_Field__c = null Negative filter operators—Using operators such as !=, NOT LIKE, or EXCLUDES in your queries. For example:SELECT CaseNumber FROM Case WHERE Status != ‘New’
Leading wildcards—Queries that use a leading wildcard, such as this:
SELECT Id, LastName, FirstName FROM Contact WHERE LastName LIKE ‘%smi%’
Text fields with comparison operators—Using comparison operators, such as >, <, >=, or <=, with text-based fields. For example:
SELECT AccountId, Amount FROM Opportunity WHERE Order_Number__c > 10
Query plan tool can suggest indexes, and gives cost of query
- Use batch Apex to query data- In general, the best way to query and process large data sets in the Force.com platform is to do it asynchronously in batches. You can query and process up to 50 million records using Batch Apex.
- Use skinny tables-Use skinny table if performance is not good enough even after using custom indexes- A skinny table
is a custom table in the Force.com platform that contains a subset of fields from a standard or custom base Salesforce object. Force.com can have multiple skinny tables if needed, and maintains them and keeps them completely
transparent to you.
- Use PK Chunking- PK Chunking is a supported feature of the Salesforce Bulk API. Now you can get the performance benefits of PK Chunking without doing all the work of splitting the queries into manageable chunks. You can simply enter a few parameters on your Bulk API job, and the platform will automatically split the query into separate chunks, execute a query for each chunk and return the data.
Primary Key Chunking
PK CHUNKING is a very important topic for this exam, PK Chunking can be used to Extract Large Data Sets from Salesforce. Primary Key Chunking helps in splitting queries into manageable chunks.
Some of the larger enterprise customers have recently been using a strategy we call PK Chunking to handle large data set extracts. PK stands for Primary Key — the object’s record ID — which is always indexed. With this method, customers first query the target table to identify a number of chunks of records with sequential IDs. They then submit separate queries to extract the data in each chunk, and finally combine the results.
With the arrival of the Spring ’15 release, primary key chunking is available in salesforce. This can be configured by adding few parameters on your Bulk API job, and the platform will automatically split the query into separate chunks, execute a query for each chunk and return the data.
Links to learn more about primary Key chunking
Bulk API is based on REST principles and is optimized for working with large sets of data. You can use it to insert, update, upsert, or delete many records asynchronously, meaning that you submit a request and come back for the results later. Salesforce processes the request in the background. Bulk api is asynchronous. Enabling the Bulk API in Data Loader
allows you to load or delete a large number of records faster than using the default SOAP-based API
Understanding Bulk API 1.0, and Bulk API 2.0 is very important.
|Bulk API 1.0||Bulk API 2.0|
Create Update, Delete and Query
Create, Update and Delete
prepare in batches
concept of batches
on a custom Rest framework
on a standard Rest framework
serial and parallel processing
What are skinny tables? What makes them fast?
- They avoid resource intensive joins
- Their tables are kept in sync with their source tables when source tables are modified
- They donot include soft deleted records
- Skinny help improve report and query performance in following ways-:
- Skinny tables provide a view across multiple objects for easy access to combined data
- Skinny tables contain frequently used fields and thereby help avoiding joins
- Skinny tables are kept in sync with changes to data in source tables
Understand report performance parameters. Performance of report depends on following
- Number of joins used in report query
- Number of records returned by report query
- Use new reports based on Analytics cloud when required
- Filters used in report should be indexed fields, as far as possible
Lets discuss Data quality
Causes of Bad data-:
- Missing Records
- Duplicate Records
- No Data Standards
- Incomplete Records
- Stale Data
Inaccurate or incomplete data can lead to 20% stalled productivity, which is one day of work each week. The average
company loses 12% of its revenue as a result of inaccurate data.Forty percent of all business initiatives fail to achieve their targeted benefits because oTo assess data quality of your org you can use App exchange
Apps like ‘Data Quality Analysis Dashboards App’
Measures to ensure good data quality
- Workflow Rules
Workflow rules are the magic wand in your Salesforce implementation act. Workflow rules let you automate standard internal procedures and processes to save time across your company. You set up workflow rules so that leads are routed to the nearest rep. You do the same to assign service requests, too. Now Gelato’s reps can focus their time on growing
business—not assigning records.
- Page Layouts
Some records have a zillion fields that you know your reps aren’t using. Ditch ’em! That’s right, you remove them
from the page layout for your reps. In fact, you create customized page layouts for different kinds of reps and managers across Gelato, to give them the fields they need when they need them. While you’re at it, you put the most important,
required fields at the top.
Why make your reps and managers wade through the swamp of reports and records? Instead, create simple dashboards to
support business objectives. For Gelato, you create a series of dashboards for managers across Gelato to show things like lead assignment and missing campaign data.
- Data Enrichment Tools
Data is obsolete almost as soon as it’s entered. That’s why it’s important to regularly match your data against a
trusted source. A number of products in Data Apps on AppExchange help you with this task.
- Duplicate Management
Duplicate records are the bane of any rep’s existence! Which record is the right record? You make sure there’s one
account record for each Gelato customer. Then you use Duplicate Management, Salesforce’s built-in duplicate management tools, to prevent duplicates from now on.
- Custom Field Types
You know the format your company wants to use for dates and currency, so you employ field types on custom fields. You
make sure to assign all custom date fields to Type = Date and all custom currency fields to Type = Currency. For fields that have a standard list of values, you use Type = Picklist. And, speaking of picklists, you set up State and Country Picklists. That way, your reps enter addresses by choosing from a standardized list of states and countries.
Install and configure Data.com Clean to monitor
Data.com Clean compares your account, contact, and lead records with records from Data.com and creates a link between
your records and matching Data.com records. Clean also provides clean status information for
accounts, contacts, and leads.
Basic difference between
lookup and master detail, custom setting and custom metadata may be asked in exam. We are not covering these topics in this blog,as these topics are easily available on internet.
Best Practices when importing large amount of data
- Defer sharing rules feature helps in suspending sharing rule for migration of large volumes of data. To suspend, resume, or recalculate sharing rule calculation: Consider deferring your sharing calculations before performing massive updates to sharing rules. When sharing is recalculated, Salesforce also runs all Apex sharing recalculations
- Remove all duplicates before importing data.
- Use right strategy, Bulk API 1.0 OR BULK API 2.0
- Deleting Data-The Salesforce data deletion mechanism can have a profound effect on the performance of large data volumes. Salesforce uses a Recycle Bin metaphor for data that users delete. Instead of removing the data, Salesforce flags the data as deleted and makes it visible through the Recycle Bin. This process is called soft deletion. While the data is soft deleted, it still affects database performance because the data is still resident, and deleted records have to be excluded from any queries.
Data archiving Strategy
Data archiving is the practice of moving data that's no longer being used to a separate storage device. Data backup expert and a senior consultant with Long View Systems Inc. Data archiving defines can be defined as "a single or a collection of historical records specifically selected for long-term retention and future reference." In addition, data archives consist of older data that is still important and necessary for future reference, as well as data that must be retained for regulatory compliance. Data archives are also indexed and have search capabilities so that files and parts of files can be easily located and retrieved.
If you want to learn more please check below recording
Thank you so much "sakshi nagpal" for a great blog post.