Documentation Release Notes Downloads FAQs

Sovren Resume Parser

Sovren implements the HROpenStandards.org Resume schema for the transfer of information contained within resumes or CVs.

The Sovren Resume/CV Parser can output as JSON or XML. The XML conforms to the HROpenStandards.org Resume schema, and the JSON has the same object structure as the XML. In this guide, we discuss the schema and how to use XML and XPath to consume the output. If you are using the JSON output, you can ignore the XPath and XML namespaces discussion, as you will need to use a JSON library (such as Newtonsoft.Json) to consume the parser output.

All HROpenStandards.org schemas define optional UserArea elements that can contain any collection of data defined by users, companies, partners or vendors. Sovren enhances the HROpenStandards.org Resume format by inserting valuable data to these UserArea elements while retaining certifiable compatibility with the standard. The full schema and naming conventions are specified in the Resume.xsd schema and the SovrenResumeExtensions.xsd schema and can be downloaded here.

Getting Started

To quickly get up and running with the Sovren Resume Parser, follow the steps below. For a more in-depth explanation of parsing, read through the rest of the provided documentation.

1. Parse the sample doc

The first step is to simply parse the provided sample document (available here). You can use REST or SOAP, and make the API calls in any programming language. Take a look at our GitHub for a few examples.

2. Save the results to disk

This step will allow you to verify that you are handling the encoding correctly when you receive results from our API. Save the parse results to disk in UTF-8 and open the file. If you see no corrupted characters/data, everything is working correctly.

3. Determine what data you want to parse for

By default, our parser will return everything it can except for military/security clearance, speaking engagements/patents/publications, and training.

It is a mistake to use speaking engagements/patents/publications data in any other capacity than simply a blob of text. This data is too unreliable/inaccurate for any other purpose.

Similarly, training data is extremely difficult to parse for. Our parser has no way of knowing every possible training course, so don't rely on complete training data to be parsed properly.

4. What data should you store?

You need to decide what data you will need to store for your application. Some customers store all of the data that the parser outputs, while others only need to store a small portion. A good starting point is:

  • Contact Info
  • Employment History
  • Metadata

5. What data should you show?

Depending if you application is used by candidates, recruiters, or both, you will want to show different data to the end users. You need to decide what is important to the success or usability of your application. We have one recommendation here: it is a mistake to show a large number of skills or jobs. Just because the parser outputs all of the data from the resume does not mean the end user finds that information valuable. Most likely, the user only cares to see the most recent jobs and the top skills.

6. What should candidates be able to review?

If your application involves candidates uploading their own resumes, you might want to consider letting the candidate review the parsed data and make changes; however, we do not recommend letting candidates review the skills/taxonomies. Letting a candidate edit this data would make it easy for someone to "game the system". Data that should typically be reviewed by candidates is contact information and the most recent job. Never make candidate review everything. Consider business rules to limit data to be reviewed to a short subset of the overall data. As one example, limit data to the most recent three jobs, or all jobs in the last 5 years.

If you need to parse resumes from countries with 4-digit postal codes, please contact Sovren support for guidance.

Configuration Basics

The Parser is highly configurable to meet your needs, but by default, is configured to parse in a way that meets most users' needs. No matter how the Parser is deployed, the Parser lives and dies for a single transaction. The SaaS service is stateless allowing the Parser can be configured completely differently for each transaction.

How to Set Configuration Parameters

In version 8.0 we redesigned the parser configuration string (defined below) and turned it into a much more readable and less error prone Name=Value pair configuration string. In order to configure the parser, simply pass a string containing settings and values for which you want to override the defaults. How to pass this to the API is detailed in the API documentation ( REST | SOAP ).

Generating the Parser Config String

Sovren provides an easy-to-use config string builder in the section below. You can paste in your existing config string, click apply and the form inputs will be filled out to reflect your settings. Your new config string will be output in a box below the form inputs. It updates automatically as you make changes to the form inputs. If you don't have a config string, review the settings to determe any that you want to change.

Here is an example config string that would turn on parsing for Military History, Security Clearance, Patents, Publications, and Speaking Engagements:

"Coverage.MilitaryHistoryAndSecurityCredentials = true; Coverage.PatentsPublicationsAndSpeakingEvents = true"

Config String Builder

Use this tool to build your own custom configuration string. Once it's generated use this string for your parsing requests.

Step 1 - Prepopulate From Existing Config String

Existing Config String

Step 2 - Edit Individual Configuration Options

Optional sections to parse in addition to default sections
Where to search for skills
Also report these as skills
Parser Output
Culture
Additional settings for uncommon scenarios...
Please consult with Sovren technical support before using the settings below. They will significantly impact parser results.
Assume Entry Level
Customize how skills are parsed (strongly not recommended for general use)
Force specific region (strongly not recommended for general use)

Step 3 - Generated Configuration String

Generated Config String

                
            

Date Format

The parser can report dates in one of two modes: ExplicitlyKnownDateInfoOnly or InferMissingDateParts. In the config string, set DateOutputStyle by adding one of the following:

"OutputFormat.DateOutputStyle = InferMissingDateParts;"
"OutputFormat.DateOutputStyle = ExplicitlyKnownDateInfoOnly;"

ExplicitlyKnownDateInfoOnly (default)

In this mode, the Parser outputs dates in one of the following formats:

<AnyDate>YYYY-MM-DD</AnyDate>
<AnyDate>notKnown</AnyDate>
<AnyDate>notApplicable</AnyDate>
<YearMonth>YYYY-MM</YearMonth>
<Year>YYYY</Year>
<StringDate>current</StringDate>

No assumptions are made about the missing date parts. For example:

Date in Resume XML Output
Nov 2000 <YearMonth>2000-11</YearMonth>
11/28/2000 <AnyDate>2000-11-28</AnyDate>
1999 <Year>1999</Year>
Present <StringDate>current</StringDate>
<not specified> <AnyDate>notKnown</AnyDate>

InferMissingDateParts

In InferMissingDateParts mode, the Parser outputs all dates in this format:

<AnyDate>YYYY-MM-DD</AnyDate>
<AnyDate>notKnown</AnyDate>
<AnyDate>notApplicable</AnyDate>

This format is convenient for database storage and other environments where the stored date must include the Year, Month and Day. It also provides continuity in date ranges where the month and/or day are implied (such as 2005-2006).

Portions of the date that are not explicitly known are inferred. A job date of "1999" is inferred to mean the date range of 1999-01-01 to 1999-12-31. A job date of "March 1999" is inferred to mean the date range of 1999-03-01 to 1999-03-31.

In addition, references to the present, such as “2005 to Present” or “March 2005 until now”, are interpreted as being the same as the RevisionDate (see below).

RevisionDate

The /Resume/StructuredXMLResume/RevisionDate element is unaffected by these settings and is always output in this format:

<RevisionDate>YYYY-MM-DD</RevisionDate>

The RevisionDate is automatically assigned to the local date of the machine that is doing the parsing. The RevisionDate value is used whenever relative dates are used, such as the term “current” in the date range “May 2006 to Current” or when calculating whether a skill is recent enough to warrant additional weighting in best-fit taxonomy calculations.

Additional Remarks

The notApplicable value is never actually output as a result of parsing a resume/CV. However, the Parser will pass-through that value if it exists in an HROpenStandards.org Resume file that is produced by some other means and then loaded into memory (i.e. hrxmlResume.LoadXml) and re-written to XML

Entry Level

Parsing Resumes from Students, Recent Graduates, and Low-Education Workers

The Parser assumes that all resumes contain Employment History and Education, and when confronted with a resume that seems to be missing Employment History or Education, it will assume that it has made a mistake, missed that data, and will try to treat other data as Employment History or Education.

Although that's a good strategy, it fails for student/new graduate/undereducated worker resumes where it is probable that their resume really does not contain any Employment History or perhaps no Education. Therefore, when parsing a resume from a student or recent graduate or a worker with no advanced education (i.e., not even high school), set "Coverage.EntryLevel = true;" in the config string (the default is false). This will tell the Parser that it's acceptable to not find Employment History, and will result in more accurate parsing for student/recent graduate resumes only.

Parser Output

Default Sections

By default, the resume parser will output the following sections:

Section Type Description Configuration Options
Contact Info The Contact Info section represents all contact related information such as name, phone number, address, etc.
Objective Job Objective that was found in the resume
Position History A list of all of the positions held by the candidate including employer, dates, descriptions, and a user area with metadata
Education A list of all education related information including school type, school name, degree type, etc...
Licenses & Certifications A list of all certifications and licenses found in the resume
Skills A list of all of the skills found in the enabled sections of the resume. Output includes skill name, total months of use, last used date, where it was found in the document, and information about the taxonomy.
Languages Includes information about the language the document was written in as well as the languages that a candidate can write/speak/read
Personal Information This section includes date of birth, gender, mother tongue, nationality, visa, etc...
Training A list of trainings specified in the resume
Achievements A list of the achievements specified by the candidate
Associations A list of the associations specified by the candidate along with their role
References A list of references specified in the resume including contact info if specified
Hobbies Outputs the text found pertaining to hobbies

Optional Sections

By default, the resume parser won't output the following sections, but they can be enabled with a configuration setting that's documented in the configuration options link:

Section Type Description Configuration Options
Patents A list of patents specified in the document
Publications A list of publications specified in the document
Speaking Engagements A list of speaking engagements specified in the document
Security Credentials A list of security credentials specified in the document
Military History A list of military history specified in the document

Contact Info

Contact Methods

Each ContactMethod element allows one of each of the following sub-elements:

  • Use
  • Location
  • WhenAvailable
  • Telephone
  • Mobile
  • Fax
  • Pager
  • TTYTDD
  • InternetEmailAddress
  • InternetWebAddress
  • PostalAddress

If a resume contains more than one of the same type of these items, such as two Telephone numbers, then they must be reported in a separate ContactMethod object. For example:

<ContactMethod>
    <Use>personal</Use>
    <Location>home</Location>
    <Telephone>
        <FormattedNumber>(858) 555-6553</FormattedNumber>
    </Telephone>
</ContactMethod>
<ContactMethod>
    <Use>work</Use>
    <Location>office</Location>
    <Telephone>
        <FormattedNumber>(858) 555-8281</FormattedNumber>
    </Telephone>
</ContactMethod>
<ContactMethod>
    <Use>personal</Use>
    <Location>home</Location>
    <InternetEmailAddress>missmadams@yahoo.com</InternetEmailAddress>
</ContactMethod>
<ContactMethod>
    <Use>twitterHandle</Use>
    <Location>home</Location>
    <WhenAvailable>anytime</WhenAvailable>
    <InternetWebAddress>@twitQueen</InternetWebAddress>
</ContactMethod>

Phone Numbers

The HROpenStandards.org standard for phone numbers takes one of two forms: Formatted or Structured. Unfortunately, a single number cannot be represented in both forms, so you must choose which to use. By default, the Parser only outputs FormattedNumber elements.

Sovren provides the config string setting OutputFormat.TelcomNumber.Style to control the phone number output format. This setting accepts the following values:

"OutputFormat.TelcomNumber.Style = Raw;"
"OutputFormat.TelcomNumber.Style = Formatted;"
"OutputFormat.TelcomNumber.Style = Structured;"
Raw (default)

Output the number in a FormattedNumber element exactly as it appeared in the original document.

<Telephone>
    <FormattedNumber>+44 76889-8876</FormattedNumber>
</Telephone>
Formatted

Output the number in a FormattedNumber element in a normalized format, if possible; otherwise fallback to Raw. US/Canadian phone numbers are normalized to this format: (NNN) NNN-NNNN, or (NNN) NNN-NNNN x NNN when an extension is included.

<Telephone>
    <FormattedNumber>(678) 555-1212 x 180</FormattedNumber>
</Telephone>
Structured

Output in the multi-element structured format, if possible; otherwise fallback to Formatted.

<Telephone>
    <InternationalCountryCode>1</InternationalCountryCode>
    <NationalNumber></NationalNumber>
    <AreaCityCode>678</AreaCityCode>
    <SubscriberNumber>555-1212</SubscriberNumber>
    <Extension>180</Extension>
</Telephone>

The Formatted and Structured settings currently only apply to US/Canadian numbers. Due to the hugely varied colloquial formats of phone numbers in other countries, we have been unable to reliably normalize the number parts. As a consequence, even if you set the style to Structured, you will still get some FormattedNumber elements in the XML, so your code will need to handle both cases.

Normalize Region

By default, the Parser reports the Region as it was detected in the document. When this setting is turned on ( "OutputFormat.NormalizeRegions = True;"), the parser normalizes Region values to the standard postal abbreviations. For example, 'Texas' to 'TX'. This setting currently only applies to US states and Canadian provinces.

Position History

Job Categories

The following type of output is always generated for each PositionHistory element:

<JobCategory>
    <TaxonomyName>Skills taxonomy</TaxonomyName>
    <CategoryCode>Information Technology (10) → Internet (196)</CategoryCode>
    <Comments>Information Technology describes 50% of this job</Comments>
</JobCategory>
<JobCategory>
    <TaxonomyName>Job Level</TaxonomyName>
    <CategoryCode>Experienced (non-manager)</CategoryCode>
</JobCategory> 

For Job Level, the CategoryCode is one of the following values, based on the length of experience and job titles:

  • Entry Level
  • Experienced (non-manager)
  • Senior (more than 5 years experience)
  • Manager
  • Senior Manager (more than 5 years management experience)
  • Executive (VP, Dept. Head)
  • Senior Executive (President, C-level)

Stripping Out Reported Data from Jobs

By default, the PositionHistory/Description element includes the descriptive text that is related to a particular PositionHistory element, but not including the portion which contains the title, company, location and date. If you want the Description element to have all of the text associated with a position, including the parsed data points, then set this option to false.

See below that the default behavior strips this text from the “Description” node:

    Technical Difference        October 2004 - Current
    Director of Web Applications Development

While this works well for most resumes, it can cause problems with some resumes that do not have all the data points together. Some data may be buried far away from other data, or at the end of the description, and in such cases, more data will be stripped out than expected, leaving an incomplete Description.

Strip Parsed Data
"OutputFormat.StripParsedDataFromPositionHistoryDescription = true;" - Default Value
<PositionHistory positionType="directHire" currentEmployer="true">
    <Title>Director of Web Applications Development</Title>
    <OrgName>
        <OrganizationName>Technical Difference</OrganizationName>
    </OrgName>
    <Description>• Managed email marketing campaigns to attract new sales and retain customers.
    • Convert current HRIS from VB to ASP to create complete web based solution.
    • Added custom encryption coding to SQL and ASP web applications.
    • Designed custom applicant tracking ASP program for large client.</Description>
    ...
</PositionHistory>
Include Parsed Data
"OutputFormat.StripParsedDataFromPositionHistoryDescription = false;" - Default Value
<PositionHistory positionType="directHire" currentEmployer="true">
    <Title>Director of Web Applications Development</Title>
    <OrgName>
        <OrganizationName>Technical Difference</OrganizationName>
    </OrgName>
    <Description>Technical Difference October 2004 - Current
    Director of Web Applications Development
    • Managed email marketing campaigns to attract new sales and retain customers.
    • Convert current HRIS from VB to ASP to create complete web based solution.
    • Added custom encryption coding to SQL and ASP web applications.
    • Designed custom applicant tracking ASP program for large client.</Description>
    ...
</PositionHistory>

Reformat PositionHistory Description

By default, the PositionHistory/Description element retains as much of the original formatting as possible. For example:

<Description>• Designed & developed web-based gift fulfillment system for use by Mazda and their affiliates. 
• Designed & developed web-based event registration systems for Mercedes and Volvo. 
• Primary contact for Automotive Ride & Drive Marketing Campaigns.</Description>

When this settings is enabled ("OutputFormat.ReformatPositionHistoryDescription = True;") the Parser will remove blank lines, split long paragraphs into separate lines, and other reformatting techniques intended to place each achievement on a separate line. Example:

<Description>Designed & developed web-based gift fulfillment system for use by Mazda and their affiliates. 

Designed & developed web-based event registration systems for Mercedes and Volvo. 

Primary contact for Automotive Ride & Drive Marketing Campaigns.</Description>

Prefer Shorter Position Titles

By default, this setting is turned off and the parser reports position titles exactly as they are found in the document. When true ("OutputFormat.PreferShorterPositionTitles = True;"), titles may be truncated if the additional phrase does not include Job words. For example, VICE PRESIDENT, INFORMATION SYSTEMS would be reported as just VICE PRESIDENT if this switch is set to true.

Position History User Area

The HROpenStandards.org Resume standard defines UserArea elements throughout the schema that can be populated with additional information that is not part of the standard. Sovren inserts significant value-added information in these UserAreas. These sections are documented in this document and defined in the SovrenResumeExtensions.xsd file.

The UserArea content for PositionHistory elements is located at /Resume/StructuredXMLResume/EmployerOrg/PositionHistory/UserArea/sov:PositionHistoryUserArea. This is what a typical PositionHistoryUserArea element looks like:

<sov:PositionHistoryUserArea>
    <sov:Id>POS-1</sov:Id>
    <sov:CompanyNameProbabilityInterpretation>Confident</sov:CompanyNameProbabilityInterpretation>
    <sov:PositionTitleProbabilityInterpretation>Confident</sov:PositionTitleProbabilityInterpretation>
    <sov:IsSelfEmployed>true</sov:IsSelfEmployed>
    <sov:SelfEmploymentPhrase>Self-employed</sov:SelfEmploymentPhrase>
    <sov:NumberOfEmployeesSupervised>15</sov:NumberOfEmployeesSupervised>
    <sov:NormalizedOrganizationName>Egypt Air</sov:NormalizedOrganizationName>
    <sov:NormalizedTitle>Accounting Manager General Manager</sov:NormalizedTitle>
    <sov:Subtitles>
        <sov:Subtitle Value="Subtitle">Accounting Manager</sov:Subtitle>
        <sov:Subtitle Value="Subtitle">General Manager</sov:Subtitle>
    </sov:Subtitles>
    <sov:Bullets>
        <sov:Bullet type="creativeTerm">Designed &amp; developed web-based gift fulfillment system for use by Mazda and their affiliates to intake consumer information from bulk mail sends, accept orders from consumers who completed their incentive program, and to report on all activity</sov:Bullet>
        <sov:Bullet type="creativeTerm">Designed &amp; developed web-based event registration systems for Mercedes and Volvo including comprehensive reporting on system activity</sov:Bullet>
        <sov:Bullet type="sentence">Primary contact for Automotive Ride &amp; Drive Marketing Campaigns</sov:Bullet>
    </sov:Bullets>
</sov:PositionHistoryUserArea>
Id

Id is a unique identifier assigned to each PositionHistory. Competency elements list the identifier of each PositionHistory element they were found within. The format of the identifier is POS-#, where # is a number that starts at 1 for the first PositionHistory and increments by 1 for each subsequent PositionHistory.

CompanyNameProbabilityInterpretation

CompanyNameProbabilityInterpretation represents the degree of certainty that the OrganizationName element value is accurate. The following scale is used:

Value Recommended Actions
VeryUnlikely Recommend Discarding
Unlikely Recommend Discarding
Probable Recommend Review
Confident No Action Needed

The Parser only reports names having a probability of 'Probable' or 'Confident', thus if the CompanyNameProbabilityInterpretation is 'Unlikely' or 'VeryUnlikely', then the OrganizationName will not be reported.

PositionTitleProbabilityInterpretation

PositionTitleProbabilityInterpretation represents the degree of certainty that the Title element value is accurate. This value uses the same scale described above for CompanyNameProbabilityInterpretation.

IsSelfEmployed

IsSelfEmployed is true when this is a self-employed position; otherwise it is false.

SelfEmploymentPhrase

When IsSelfEmployed is true, SelfEmploymentPhrase contains the exact text from the resume that indicates this is a self-employed position.

NumberOfEmployeesSupervised

NumberOfEmployeesSupervised is the number of employees that the candidate supervised in this position.

NormalizedOrganizationName

The normalized OrganizationName.

NormalizedTitle

The normalized PositionTitle.

Subtitles

Any number of subtitles that could be used to categorize the position title. These are useful for grouping positions that have similar titles into buckets for searching and matching.

Bullets

When "OutputFormat.CreateBullets = true;" in the config string, the UserArea will include a "bullet" based interpretation of the Description text in which each significant sentence/line/paragraph is reported as a separate sov:Bullet element. This can be useful when transforming the output into a standard resume document format and you want each major point to be a bullet.

The type attribute of each sov:Bullet element is one of the following values:

  • creativeTerm: Bullet text contains one of the phrases from the CREATIVE_ACTION_WORDS data list (such as “implemented”, “initiated”, and “developer on”).
  • sentence: This is the default when the type is not creativeTerm.

Here is an example of the output with this feature turned on:

<sov:PositionHistoryUserArea> 
  <sov:Id>POS-3</sov:Id> 
  <sov:CompanyNameProbability>20</sov:CompanyNameProbability> 
  <sov:PositionTitleProbability>21</sov:PositionTitleProbability> 
  <sov:Bullets> 
    <sov:Bullet type="creativeTerm">Designed & developed web-based gift fulfillment system</sov:Bullet> 
    <sov:Bullet type="creativeTerm">Designed & developed web-based registration systems for Mercedes</sov:Bullet> 
    <sov:Bullet type="sentence">Primary contact for Automotive Ride & Drive Marketing Campaigns</sov:Bullet> 
  </sov:Bullets> 
</sov:PositionHistoryUserArea>

Education

There are no configuration options for this section type. Here is an explanation of the output.

Degrees

In the HROpenStandards.org output, the Parser reports the level of education in the degreeType attribute of the Degree element: <Degree degreeType="bachelors">

These values are not very global-friendly, but the Parser does normalize all degrees to one of these pre-defined degreeTypes. This list is sorted, as well as possible, by increasing level of education. Although, there are certainly ambiguities from one discipline to another, such as whether professional is above or below masters Here are the possible values:

  • specialeducation
  • some high school or equivalent
  • ged
  • secondary
  • high school or equivalent
  • certification
  • vocational
  • some college
  • HND/HNC or equivalent
  • associates
  • international
  • bachelors
  • some post-graduate
  • masters
  • intermediategraduate
  • professional
  • postprofessional
  • doctorate
  • postdoctorate

School Types

The HROpenStandards.org enumeration schoolTypes is used for values in the schoolType attribute of the SchoolOrInstitution element like this:

<EducationHistory>
    <SchoolOrInstitution schoolType="lowerSchool">
        <School>
            <SchoolName>St John The Baptist Primary School</SchoolName>
        </School>
        ...

The HROpenStandards.org standard defines these values:

  • highschool
  • secondary
  • trade
  • community
  • college
  • university

The Sovren Resume Parser also defines these values:

Value Meaning
UNSPECIFIED The HROpenStandards.org schema requires a value for schoolType, but the value is not always known. This value is reported when the school type is not known.
lowerSchool Anything below high school
professional Continuing education for professional careers
vocational Vocational schools

Degree User Area

The HROpenStandards.org Resume standard defines UserArea elements throughout the schema that can be populated with additional information that is not part of the standard. Sovren inserts significant value-added information in these UserAreas. These sections are documented in this document and defined in the SovrenResumeExtensions.xsd file.

The UserArea content for Degree elements is located at /Resume/StructuredXMLResume/EducationHistory/Degree/UserArea/sov:DegreeUserArea. This is what a typical DegreeUserArea element looks like:

<sov:DegreeUserArea>
<sov:Id>DEG-1</sov:Id>
<sov:Graduated>false</sov:Graduated>
<sov:NormalizedGPA>0.85</sov:NormalizedGPA>
</sov:DegreeUserArea>
Id

Id is a unique identifier assigned to each Degree. Competency elements list the identifier of each Degree element they were found within. The format of the identifier is DEG-#, where # is a number that starts at 1 for the first Degree and increments by 1 for each subsequent Degree.

Graduated

Graduated is a Boolean value that indicates whether the degree was completed. It is not always safe to assume that just because a degree is listed it was completed, and there is usually not enough information to determine graduation status from the resume itself, but some candidates do report that they didn’t finish (or haven’t yet finished) the degree. Possible values:

  • Element is not output, indicating that the Parser has no information.
  • false: Indicating that the degree was not completed or the candidate is still pursuing the degree.
  • true: Indicates that the degree was completed.
NormalizedGPA

NormalizedGPA is a decimal value that is output only when a GPA has been provided. This value is normalized from 0.0 to 1.0, with 1.0 being the top mark, so that all GPAs across all scales can be compared, taking into account different min/max values and whether high or low numbers are ranked higher. For example:

  • USA degree with GPA of 3.5 / 4.0 = 0.875
  • German degree with 1.5 / 6.0 = 0.916

Licenses & Certifications

There are no configuration options for this section type. Here is an explanation of the output.

Licenses and certifications are reported in LicenseOrCertification elements found within /Resume/StructuredXMLResume/LicensesAndCertifications.

<LicensesAndCertifications>
    <LicenseOrCertification>
        <Name>Project Management Professional</Name>
        <Description>certification; matched to list</Description>
        <EffectiveDate>
            <FirstIssuedDate>
                <YearMonth>2001-09</YearMonth>
            </FirstIssuedDate>
        </EffectiveDate>
    </LicenseOrCertification>
</LicensesAndCertifications>

Name

The name or phrase that describes the license or certification. This value is not standardized or mapped to any pre-defined list.

Description

This element reports additional information about the license or certification. It is one of the following values, where the text in square brackets is conditionally output depending on the context:

  • license[; found in LICENSES][; matched to list]
  • certification[; found in CERTIFICATIONS][; matched to list]

The “found in LICENSES” note indicates that the license was found when parsing the text of a LICENSES section.

The “found in CERTIFICATIONS” note indicates that the certification was found when parsing the text of a CERTIFICATIONS section.

The “matched to list” note indicates that the license was found anywhere within the text of the resume/CV based on matching a specific keyword, key phrase, or pattern as defined in one of the Parser’s data lists.

EffectiveDate.FirstIssuedDate

The date of the license or certification, if any.

EffectiveDate.ValidFrom & EffectiveDate.ValidTo

The effective date range, if any.

Skills

Where To Look For Skills

By default, the parser looks in the following sections for skills:

Section Type Config String To Turn Section Off
Achievements "Coverage.FindSkillsInAchievements = False;"
Certifications "Coverage.FindSkillsInCertifications = False;"
Cover Letter "Coverage.FindSkillsInCoverLetter = False;"
Education "Coverage.FindSkillsInEducationHistory = False;"
Executive Summary "Coverage.FindSkillsInExecutiveSummary = False;"
Languages "Coverage.FindSkillsInLanguages = False;"
Licenses "Coverage.FindSkillsInLicenses = False;"

Also Report These As Skills

By default, the parser doesn't report any of these data types as skills. To report any of the following data types as skills refer to the config string value in the table.

Section Type Config String To Report Data Type as Skill
Position Titles "Coverage.AddPositionTitlesToSkills = True;"
Languages "Coverage.AddLanguagesToSkills = True;"
Licenses & Certifications "Coverage.AddCertificationsAndLicensesToSkills = True;"

Skills Taxonomy Output

This section contains the skill/competency data in the Sovren-preferred format. You may prefer to consume this data rather than the data in the Competencies section or use a combination of both. Note that both sections contain the same data, only the format is different.

<sov:SkillsTaxonomyOutput>
    <sov:TaxonomyRoot name="Sovren">
        <sov:Taxonomy name="Information Technology" id="10" percentOfOverall="57">
            <sov:Subtaxonomy name="Database" id="193" percentOfOverall="33" percentOfParentTaxonomy="58">
                <sov:Skill name="DATABASES" existsInText="true" whereFound="Found in SKILLS" childrenTotalMonths="21" childrenLastUsed="2004-12-01">
                    <sov:ChildSkill name="DATABASE" existsInText="true" totalMonths="21" lastUsed="2004-12-01" whereFound="Found in SUMMARY; WORK HISTORY; POS-2"></sov:ChildSkill>
                </sov:Skill>
                <sov:Skill name="MICROSOFT ACCESS" existsInText="false">
                    <sov:ChildSkill name="ACCESS 97" existsInText="true" whereFound="Found in SKILLS"></sov:ChildSkill>
                </sov:Skill>
                <sov:Skill name="MS SQL SERVER" existsInText="false" childrenTotalMonths="45" childrenLastUsed="2004-12-01">
                    <sov:ChildSkill name="SQL SERVER" existsInText="true" totalMonths="45" lastUsed="2004-12-01" whereFound="Found in SKILLS; SUMMARY; WORK HISTORY; POS-2; POS-4"></sov:ChildSkill>
                </sov:Skill>
                <sov:Skill name="ORACLE" existsInText="true" whereFound="Found in SKILLS"></sov:Skill>
                <sov:Skill name="SQL" existsInText="true" totalMonths="194" lastUsed="2017-05-02" whereFound="Found in SKILLS; SUMMARY; WORK HISTORY; POS-1; POS-2; POS-4"></sov:Skill>
            </sov:Subtaxonomy>
            <sov:Subtaxonomy name="Internet" id="196" percentOfOverall="10" percentOfParentTaxonomy="18">
                <sov:Skill name="ASP" existsInText="true" totalMonths="151" lastUsed="2017-05-02" whereFound="Found in SKILLS; SUMMARY; WORK HISTORY; POS-1"></sov:Skill>
                <sov:Skill name="WEB BASED" existsInText="true" totalMonths="151" lastUsed="2017-05-02" whereFound="Found in WORK HISTORY; POS-1" childrenTotalMonths="31" childrenLastUsed="2004-12-01">
                    <sov:ChildSkill name="WEB-BASED" existsInText="true" totalMonths="31" lastUsed="2004-12-01" whereFound="Found in WORK HISTORY; POS-2; POS-3"></sov:ChildSkill>
                </sov:Skill>
            </sov:Subtaxonomy>
        </sov:Taxonomy>
        ...
    </sov:TaxonomyRoot>
</sov:SkillsTaxonomyOutput>

As you can see above, this view of the skills is structured in the hierarchical manner that matches the Taxonomy > Subtaxonomy > Skill > Child Skill structure that the parser understands. By default, there will only be one TaxonomyRoot, "Sovren". However, if you have added custom skills as described in Customizing Skills, you may see the "Sovren" TaxonomyRoot and your custom TaxonomyRoot(s).

The following table lists the elements and attributes associated with each of the elements above.

Element.Attribute Meaning
*.name Name of the root data list/taxonomy/subtaxonomy/skill.
{Taxonomy | Subtaxonomy}.id Unique identifier either provided by built-in Sovren taxonomy or provided by you in a custom taxonomy.
{Taxonomy | Subtaxonomy}.percentOfOverall The weight of a specific taxonomy/subtaxonomy (and its children) divided by the total of all skill weights across all taxonomies, expressed as a percentage. The sum of all Taxonomy.percentOfOverall or all Subtaxonomy.percentOfOverall elements equals 100%.
Subtaxonomy.percentOfParentTaxonomy The weight of a specific subtaxonomy (and its children) divided by the weight of its parent taxonomy, expressed as a percentage. The sum of all percentOfParent values for all siblings (subtaxonomies with the same parent) equals 100%.
{Skill | ChildSkill}.existsInText True if the skill/childskill was actually found in the resume text. False if we are only reporting this skill as a parent of a skill that was found.
{Skill | ChildSkill}.whereFound List of sections that the skill was found within. Starts with “Found in ” and then has a semicolon delimited list with the possible values:
  • CERTIFICATIONS
  • COVER_LETTER
  • EDUCATION
  • LANGUAGES
  • LICENSES
  • PERSONAL INTERESTS AND ACCOMPLISHMENTS QUALIFICATIONS_SUMMARY
  • SKILLS
  • SUMMARY
  • WORK HISTORY
  • POS-# (corresponds to a specific PositionHistory)
  • DEG-# (corresponds to a specific Degree)
{Skill | ChildSkill}.lastUsed Most recent date that this skill was used.
{Skill | ChildSkill}.totalMonths Cumulative number of months of experience with this skill. Predominantly based on Work History and Education dates.
Skill.childrenLastUsed Most recent date that any of the skill's children were used.
Skill.childrenTotalMonths Sum of all the ChildSkill.totalMonths (accounting for overlaps) for all of this skill's children.

Languages & Locales

The Parser includes a language and locale analyzer that is able to accurately detect all supported Parser languages and can detect and set most supported locales based on an analysis of language, phone numbers, and email addresses. It is NEVER necessary or advisable to manually override the Parser's language detection, and it is rarely advisable to override the Parser's locale detection.

So, when might it be advisable to override the default locale detection? In some cases, you may be certain that you are parsing a CV from a particular locale and you want to ensure that the Parser "knows" about that locale even if the CV does not have any information on it that would readily tell it that it is from that locale (for example, if the CV contains no contact info).

Here is an example: if you are processing CVs in or from Australia, Australia uses a four-digit postal code. You may desire to set "Culture.DefaultCountryCode = AU;" in the config string. This will give better results on a few Australian CVs that lack enough contact info for the Parser to detect that the CV contains Australian locale data. HOWEVER, a side effect is that, when that switch is "on" and a non-Australian CV is parsed, the Parser may erroneously report Australian contact info rather than the correct locale's contact info. For instance:

    John Smith
    Suite 404
    3017 Sydney
    Dallas, TX 75225

This is actually a USA address, and will possibly be reported by the Parser as being an address in postal code 3017 in Sydney, Australia rather than at 3017 Sydney Street in Dallas, Texas, USA in postal code 75225.

Our general recommendation is that only the following locale switches are advisable to set "on", and then only when the CV is almost certain to contain that locale’s data:

  • Set "Culture.DefaultCountryCode = IN;" if you are parsing in India
  • Set "Culture.DefaultCountryCode = AU;" if parsing in Australia or New Zealand (you can use either AU or NZ) and you have Australian or New Zealand locale CVs
  • Set "Culture.DefaultCountryCode = ZA;" if you are parsing in South Africa

Again, setting these switches assumes that you really have a CV flow that is almost completely from those regions.

Please note that the Parser is conforms exactly to the HROpenStandards.org Resume 2.5 standard, and that because of the necessity of conforming to the exact requirements of those standards, the Parser is REQUIRED to output a "CountryCode" every time it reports any location information. Unfortunately, it is not always possible to accurately determine the correct country code (Boston, UK or Boston, USA?), so at times the Parser must make an educated guess since it is required by that standard to report a CountryCode.

Personal Information

The PersonalInformation element contains a variety of information that is commonly used in some cultures and not in other cultures such as the United States. The parser will output the following data fields:

  • Ancestor (FathersName and MothersMaidenName)
  • Availability
  • Birthplace
  • DateOfBirth
  • DrivingLicense
  • FamilyComposition
  • Gender
  • Hukou (HukouCity and HukouArea)
  • Location (CurrentLocation and PreferredLocation)
  • MaritalStatus
  • MessagingAddresses
  • MotherTongue
  • NationalIdentityNumber
  • Nationality
  • Passport
  • Politics
  • Salary (CurrentSalary and RequiredSalary)
  • Visa

Some of the personal information can be inferred from other information within the resume. For example, Gender may be inferred from “Mr.” being part of the name.

Here is a sample PersonalInformation element containing every element that is supported:

<sov:PersonalInformation>
    <sov:DateOfBirth>1977-10-20</sov:DateOfBirth>
    <sov:Birthplace>Los Angeles, CA</sov:Birthplace>
    <sov:Nationality>US</sov:Nationality>
    <sov:NationalIdentities>
        <sov:NationalIdentity>
            <sov:NationalIdentityNumber>111-22-3333</sov:NationalIdentityNumber>
            <sov:NationalIdentityPhrase>Social Security Number</sov:NationalIdentityPhrase>
        </sov:NationalIdentity>
    </sov:NationalIdentities>
    <sov:Gender>Female</sov:Gender>
    <sov:MaritalStatus>Married</sov:MaritalStatus>
    <sov:DrivingLicense>CA-123123123</sov:DrivingLicense>
    <sov:CurrentLocation>Solana Beach, CA</sov:CurrentLocation>
    <sov:PreferredLocation>Boston, MA</sov:PreferredLocation>
    <sov:WillingToRelocate>Yes</sov:WillingToRelocate>
    <sov:FamilyComposition>Husband and 2 children</sov:FamilyComposition>
    <sov:FathersName>John Adams, II</sov:FathersName>
    <sov:MothersMaidenName>Angela Harris</sov:MothersMaidenName>
    <sov:Availability>Immediate, with 2 weeks notice</sov:Availability>
    <sov:VisaStatus>Green Card, expires March 2012</sov:VisaStatus>
    <sov:PassportNumber>US-456456456</sov:PassportNumber>
    <sov:CurrentSalary currency="USD">100000.00</sov:CurrentSalary>
    <sov:RequiredSalary currency="USD">110000.00</sov:RequiredSalary>
    <sov:HukouCity>湛江市</sov:HukouCity>
    <sov:HukouArea>海南</sov:HukouArea>
    <sov:PoliticalAffiliation>党员</sov:PoliticalAffiliation>
    <sov:MessagingAddress type="ICQ">john3@adams.com</sov:MessagingAddress>
    <sov:MotherTongue>en</sov:MotherTongue>
</sov:PersonalInformation>

DateOfBirth

Date of birth in yyyy-MM-dd format. If the optional inferred attribute (Boolean) is true then the DateOfBirth was inferred from an Age using the following formula: [RevisionDate] - [Age years] - [6 months]

Birthplace

Freeform text that identifies the candidate’s place of birth.

Nationality

Freeform text that identifies the candidate’s country of citizenship. If the optional inferred attribute (Boolean) is true then the Nationality was inferred rather than explicitly stated.

NationalIdentities

Zero or more NationalIdentity elements.

NationalIdentityNumber

Country-specific national identity number. In order to prevent false positives, the Parser requires that the numbers be in specific formats. If numbers are not being reported, it may be due to the number being in an unsupported format. We will continue adding support for new formats, so please submit any examples to support@sovren.com.

NationalIdentityPhrase

An optional phrase associated with the NationalIdentityNumber to help identify it.

NationalIdentityType

Currently only “DNI” or “NIE” if issued by Spain.

Gender

Male or Female. If the optional inferred attribute (Boolean) is true then the Gender was inferred from the name affix, marital status, national identity number, given name, or some other means. To customize the inference by given name, customize the MALE_GIVEN_NAMES and FEMALE_GIVEN_NAMES data lists.

MaritalStatus

Married, Single, Divorced, Separated, or Unknown. If the optional inferred attribute (Boolean) is true then the MaritalStatus was inferred from the name affix, family composition, national identity number, or some other means.

DrivingLicense

Freeform text that identifies the candidate’s license to drive. May include a license number, type, qualifications, restrictions or any other explanation.

CurrentLocation

Freeform text that identifies the candidate’s current location(s), if specifically stated as such. This value is NOT derived from the contact information postal address.

PreferredLocation

Freeform text that identifies the candidate’s preferred location(s).

WillingToRelocate

One of the following values indicating the candidate’s willingness to relocate: Yes, No, or Unknown.

FamilyComposition

Freeform text that describes the candidate’s family, such as spouse and children.

FathersName

Freeform text that identifies the name of the candidate’s father.

MothersMaidenName

Freeform text that identifies the maiden name of the candidate’s mother.

Availability

Freeform text that describes when the candidate is available to work.

VisaStatus

Freeform text that describes the candidate’s current visa status, expiry date, etc.

PassportNumber

Freeform text that identifies the candidate’s passport number, expiry date, etc.

CurrentSalary

The candidate’s current salary expressed as a monetary amount. The element value is a number. The type attribute is a 3-letter ISO 4217 currency code. This element does not specify whether the monetary amount is annually, monthly, or hourly, however that information can usually be inferred from the value.

RequiredSalary

The salary the candidate expects for any new position, expressed as a monetary amount. The element value is a number. The type attribute is a 3-letter ISO 4217 currency code. This element does not specify whether the monetary amount is annually, monthly, or hourly, however that information can usually be inferred from the value.

HukouCity

Name of City for Chinese household registration (hukou record).

HukouArea

Area/Province for Chinese household registration (hukou record).

PoliticalAffiliation

Freeform text specifying the candidate’s political affiliation.

MessagingAddress

Zero or more MessagingAddress elements. The type attribute identifies the messaging system, such as ICQ, MESSENGER, QQ, etc. The element value is the candidate’s address within that messaging system.

MotherTongue

The mother tongue (also known as primary language, native language, or first language) of the candidate. The value is one of the ISO 639-1 codes. For example: Dutch (nl), English (en), French (fr), or the special value Invariant/Unknown (iv).

Training

The Parser will report training elements that are found in the document. For example, this text appearing within a PositionHistory Description will also be reported in the Training element of the UserArea as shown in the box below:

    Training:
    Project Management Professional, Project Management Institute, 2004-2005
    Microsoft Visual Basic .NET, 2001
<sov:Training>
    <sov:Text>Project Management Professional, Project Management Institute, 2004-2005
Microsoft Visual Basic .NET, 2001</sov:Text>
    <sov:Item>
        <sov:Type>Unknown</sov:Type>
        <sov:TrainingName />
        <sov:Qualifications>
            <sov:Qualification>Project Management Professional</sov:Qualification>
        </sov:Qualifications>
        <sov:Entity>Project Management Institute</sov:Entity>
        <sov:Description>Project Management Professional, Project Management Institute, 2005</sov:Description>
        <StartDate>
            <Year>2004</Year>
        </StartDate>
        <EndDate>
            <Year>2005</Year>
        </EndDate>
    </sov:Item>
    <sov:Item>
        <sov:Type>Unknown</sov:Type>
        <sov:TrainingName />
        <sov:Entity />
        <sov:Description>Microsoft Visual Basic .NET, 2001</sov:Description>
        <EndDate>
            <Year>2001</Year>
        </EndDate>
    </sov:Item>
</sov:Training>

Each distinct item of training is reported as an Item element within Training.

Type

Reserved for future use.

TrainingName

Reserved for future use.

Qualifications

Any text within Description that is recognized as a qualification (such as DDS), degree (such as B.S.), or a certification (such as Project Management Professional). Each qualification is listed separately.

Entity

Name of school or company

Description

All of the text associated with this training item.

StartDate

Start date of this training item.

EndDate

End date of this training item.

Patents/Publications/Speaking Engagements

When parsing of Patents, Publications, and Speaking Engagements is enabled, by setting "Coverage.PatentsPublicationsAndSpeakingEvents = True;" in the config string, these sections may be reported.

These sections are impossible to parse at a granular level with any meaningful accuracy. Do not use this data except perhaps as an indicator that the document contains such sections.

Patents

For example, this text within a resume results in the following output.

Patents
George Doam and Neil Griffin, inventors, “Method and Apparatus for Removing Corn Kernels From Dentures”, Patent 1,064,098.
<PatentHistory>
	<Patent>
		<PatentTitle>Method and Apparatus for Removing Corn Kernels From Dentures</PatentTitle>
		<Description>George Doam and Neil Griffin, inventors, "Method and Apparatus for Removing Corn Kernels From Dentures", Patent 1,064,098.</Description>
		<Inventors>
			<InventorName>George Doam and Neil Griffin</InventorName>
		</Inventors>
		<PatentDetail>
			<PatentMilestone>
				<Id>1064098</Id>
			</PatentMilestone>
		</PatentDetail>
	</Patent>
</PatentHistory>

Publications

For example, this text within a resume results in the following output.

Publications
"The Way Home:  How GPS Restored My Profits and Saved My Business Life", published in the American Journal of the Lost And Clueless, Volume 1, Number 4.
<PublicationHistory>
	<Article>
		<Title>The Way Home: How GPS Restored My Profits and Saved My Business Life</Title>
		<JournalOrSerialName>published in the American Journal of the Lost And Clueless</JournalOrSerialName>
		<Issue>Volume 1, Number 4</Issue>
	</Article>
</PublicationHistory>

Speaking Engagements

For example, this text within a resume results in the following output.

Speaking Engagements
Main Speaker, AYA Forum, 2006
<SpeakingEventsHistory>
	<SpeakingEvent>
		<EventName></EventName>
		<EventType>conference</EventType>
		<Description>Main Speaker, AYA Forum, 2006</Description>
	</SpeakingEvent>
</SpeakingEventsHistory>

Military History & Security Clearance

When parsing for Military History and Security Clearance is enabled, by setting "Coverage.MilitaryHistoryAndSecurityCredentials = True;" in the config string, these sections may be reported.

Military History

For example, this text within a resume results in the following output.

Military Service
FIRST LIEUTENANT, US Army, Vietnam theatre, 1966-1967
<MilitaryHistory>
	<CountryServed>VN</CountryServed>
	<ServiceDetail branch="US Army">
		<RankAchieved>
			<CurrentOrEndRank>FIRST LIEUTENANT</CurrentOrEndRank>
		</RankAchieved>
		<DatesOfService>
			<StartDate>
				<Year>1966</Year>
			</StartDate>
			<EndDate>
				<Year>1967</Year>
			</EndDate>
		</DatesOfService>
	</ServiceDetail>
	<Comments>FIRST LIEUTENANT, US Army, Vietnam theatre, 1966-1967</Comments>
</MilitaryHistory>

Security Clearance

For example, this text within a resume results in the following output.

CLEARANCE
Top Secret, expires 2007
<SecurityCredentials>
	<SecurityCredential>
		<Name>Top Secret, expires 2007</Name>
		<EffectiveDate>
			<FirstIssuedDate>
				<Year>2007</Year>
			</FirstIssuedDate>
		</EffectiveDate>
	</SecurityCredential>
</SecurityCredentials>

Sovren Generated Metadata

ResumeUserArea

The HROpenStandards.org Resume standard defines UserArea elements throughout the schema that can be populated with additional information that is not part of the standard. Sovren inserts significant value-added information in these UserAreas. These sections are documented in this document and defined in the SovrenResumeExtensions.xsd file.

The UserArea content for Resume elements is located at /Resume/UserArea/sov:ResumeUserArea. This is how the ResumeUserArea element is structured (with many of the details omitted to keep it short enough to review at a glance):

<sov:ResumeUserArea>
    <sov:Culture/>
    <sov:Location/>
    <sov:PersonalInformation/>
    <sov:ExperienceSummary/>
    <sov:Training/>
    <sov:Sections/>
    <sov:CustomData/>
    <sov:CoverLetterText/>
    <sov:ParsedTextLength>6103</sov:ParsedTextLength>
    <sov:SearchHints>String</sov:SearchHints>
    <sov:ParseTime>673</sov:ParseTime>
    <sov:ResumeQuality/>
    <sov:TimedOut type="hard">0</sov:TimedOut>
    <sov:ParserConfigurationString/>
    <sov:ParserVersion>6.4.6400.8</sov:ParserVersion>
</sov:ResumeUserArea>

Culture

The Culture element describes the Language and Region information that is either:

  1. Calculated during parsing according to an analysis of the text, or
  2. The default specified by in case the culture cannot be calculated.

This culture information influences the way the Parser works, such as how it interprets ambiguous date values such as (5/1/09) or differing linguistic rules for analyzing the text.

A typical Culture element looks like this:

<sov:Culture>
    <sov:Language>en</sov:Language>
    <sov:Country>US</sov:Country>
    <sov:CultureInfo>en-US</sov:CultureInfo>
</sov:Culture>
Language

The primary language of the parsed text. The value is one of the ISO 639-1 codes. When the language could not be automatically determined, it is reported as the special value Invariant/Unknown (iv). The two-letter ISO codes reported by the Parser, such as “zh” for Chinese, do not differentiate between language variants, such as Mandarin and Cantonese.

The language is also reported in the top-level Resume element:

    <Resume xml:lang="en"

See the Sovren Resume Parser User Guide provided with each version of the Parser for a list of languages supported that version. For a listing of languages and regions supported the most recent version, you can refer to https://www.sovren.com/resume-job-parser/specs/

Country

The country of origin of the resume, typically determined by the postal address. The value is one of the 2-letter ISO 3166 codes. For example, “US” for United States.

There is one exception, for all builds prior to 8.0: Prior to version 8.0, United Kingdom is "UK" instead of "GB" by default. For these pre-8.0 versions, to adhere to the ISO-3166 standard by using “GB” for United Kingdom, you can set "Culture.CountryCodeForUnitedKingdomIsUK = false;" in the config string. This setting defaults to true for backward compatibility.

CultureInfo

This is an ISO 3066 code that represents the actual cultural context regarding formatting of numbers, dates, character symbols, and so on. This value is usually a simple concatenation of the Language and Country codes, such as "en-US" for US English, but beware that CultureInfo can be set independently of Language and Country to achieve fine-tuned cultural control over parsing, so if you use this value you should not assume that it always matches the Language and Country.

Prefer English Version of Resume

When a document contains two versions of the resume, one in English and one in another language, the default behavior is to parse the non-English (presumably native) version. Set this property to true ("Culture.PreferEnglishVersionIfTwoLanguagesInDocument = True;") to always parse the English version, when available. This setting currently only applies to Chinese-English resumes.

Location

The Location element provides a place to store the geographic coordinates for the primary PostalAddress. This data is no longer provided by Sovren, but instead, is available through the SaaS API. If you are an “installed” customer of Sovren, you have access to the geocoding API call on your own instance of the Sovren SaaS service, using your own credentials to Bing or Google.

A typical Location element looks like this:

<sov:Location>
    <sov:Latitude inferred="true">32.9937</sov:Latitude>
    <sov:Longitude inferred="true">-117.2598</sov:Longitude>
</sov:Location>

Latitude

The latitude of the primary PostalAddress. The inferred attribute is always true for values output by the Parser. If you specify your own known value, then remove the inferred attribute or set it to false.

Longitude

The longitude of the primary PostalAddress. The inferred attribute is always true for values output by the Parser. If you specify your own known value, then remove the inferred attribute or set it to false.

Sections

One of the first things the Parser does is split the resume into sections. Each section is then handed to a sub-parser that knows how to handle the type of information in each section. The Sections element contains a collection of Section elements, each of which identifies the types and locations of sections that were found.

<sov:Section starts="25" ends="26" sectionType="SUMMARY">Summary</sov:Section>
<sov:Section starts="27" ends="28" sectionType="OBJECTIVE">OBJECTIVE:</sov:Section>
<sov:Section starts="29" ends="72" sectionType="WORK HISTORY">Experience</sov:Section>
<sov:Section starts="73" ends="79" sectionType="EDUCATION">Education</sov:Section>
starts

The first line number (zero-based) containing text of this section.

ends

The last line number (zero-based) containing text of this section.

sectionType

One of the following values: ARTICLES, AVAILABILITY, BOOKS, CERTIFICATIONS, CONFERENCE_PAPERS, CONTACT_INFO, EDUCATION, HOBBIES, IGNORE_DATA_AFTER, LANGUAGES, LICENSES, MILITARY, OBJECTIVE, OTHER_PUBLICATIONS, PATENTS, PERSONAL_INTERESTS_AND_ACCOMPLISHMENTS, PROFESSIONAL_AFFILIATIONS, QUALIFICATIONS_SUMMARY, REFERENCES, SECURITY_CLEARANCES, SKILLS, SPEAKING, SUMMARY, TRAINING, WORK_HISTORY, WORK_STATUS

Value

The value is the exact text that was used to identify the beginning of the section. If there was no text indicator and the location was calculated, then the value is “CALCULATED”.

ReservedData

The Parser uses this section to output all of the URLs, Email Addresses, Phone Numbers, and Twitter handles found anywhere in the document. These values are not necessarily tied to the candidate.

CustomData

The parser can be customized to extract additional types of data through the use of regular expressions that you provide. For each match, a CustomDataMatch element is reported. The CustomDataMatch elements are reported in the same order they are found in the resume text.

For example, the following output represents an item that was found by setting "Data.UserDefinedParsing = True;" in the config string. This setting loads all of the built-in custom data definitions. You can define your own set by supplying it with a custom version of the CUSTOM_DATA_DEFINITIONS data list. For more information about how to provide this information via a CUSTOM_DATA_DEFINITIONS list, please read Customizing Data Lists.doc.

<sov:CustomData>
    <sov:CustomDataMatch type="SocialNetworkingUrls">
        http://www.linkedin.com/in/missmadams
    </sov:CustomDataMatch>
</sov:CustomData>
Value

The value is the text that was matched by the custom data definition.

type

The name assigned to the definition that found the text.

The built-in custom data definition types include:

  • Travel: Freeform text following any “Travel” field label.
  • WorkWeek: Text indicating employment type of full-time or part-time.
  • EmployeeType: Text of any field indicating Employee, Temporary, Contractor, Project.
  • MonsterLocations: Text following “Location” field label in the form US-CA-Sacramento.
  • SocialNetworkUrls: Links for Facebook, LinkedIn, MySpace, Plaxo, Twitter, Zing, ZoomInfo.
  • ProjectName: Text following variety of field labels indicating a Project Name.
  • ProjectEmployer: Text following variety of field labels indicating a Project Employer.
  • ProjectRole: Text following variety of field labels indicating a Project Role.

CoverLetterText

This element reports all of the text that was determined to be part of a cover letter. HROpenStandards.org Resume elements are NOT parsed from this text.

ParsedTextLength

This element reports the number of characters in the plain text resume.

ParseTime

This element reports the number of milliseconds that were spent within the Parser. This value does not include network transfer times.

ResumeQuality

The Resume Quality is a series of assessments of how well the resume conforms to Sovren's resume building best practices. Assessments are ordered by severity, from fatal problems to suggested improvements. Each assessment contains a list of findings, describing the exact issue with the resume and a recommendation to resolve the issue.

<sov:ResumeQuality>
    <sov:Assessments>
        <sov:Assessment>
            <sov:Level>Major Issues Found</sov:Level>
            <sov:Findings>            
                <sov:Information>[Sovren:311;] The resume contains a contact information section that is not the first section. A resume should always include contact info at the top of the resume.</sov:Information>
                <sov:Information>[Sovren:323;LANGUAGES,SKILLS] The following section types appear multiple times in the resume: LANGUAGES (2 occurrences), SKILLS (2 occurrences). Each section should only appear once in a resume.</sov:Information>
                <sov:Information>[Sovren:325;6] The following sections do not have a header: Section with index 6 (of type: WORK HISTORY). Every section should have a header directly above the content associated to it.</sov:Information>
            </sov:Findings>
        </sov:Assessment>
    </sov:Assessments>
</sov:ResumeQuality>

Level

The level of severity of the findings for the assessment. Ranging from, in order of most severe to least severe:

  • Fatal Problems Found
  • Major Issues Found
  • Data Missing
  • Suggested Improvements

Findings

A list of information with a code, associated identifiers, and a message describing the issue or recommendation found. Use these findings to improve the resume and conform to Sovren's Tips for Electronic Resumes.

Information

A string with 3 important pieces of information: [Sovren:{code};{identifiers}] {message}

  • Code: unique code to identify a resume quality finding information (see the chart below with all codes and identifier meanings)
  • Identifiers: identifiers for the associated data, formatted in a comma separated list: {id1},{id2},{id3} (e.g. section indexes, work history position identifiers, etc.)
  • Message: the display message to understand the issue and recommendation
Code Description Associated Identifiers
Fatal Problems Found (400-499)
411 Indicates that parsing had to be stopped because the time limit was exceeded and some data may not have been processed.
412 Indicates that no sections were found in the resume.
413 Indicates that a WORK HISTORY section was not found.
414 Indicates that an EDUCATION section was not found.
415 Indicates that a WORK HISTORY information was found but had to be calculated as a section.
416 Indicates that an EDUCATION information was found but had to be calculated as a section.
417 Indicates that this document is likely a curriculum vitae and prone to errors due to the use of nonstandard headers and the vast amount of data describing patents, speaking engagements, research, advisory roles, publications, etc. Accordingly, only the first WORK HISTORY section was parsed, as that usually results in far greater accuracy.
Major Issues Found (300-399)
301 Indicates that a contact phone number or email address could be found. A resume should include at least one phone number or one email address.
302 Indicates that the first and last name for the candidate was not found.
303 Indicates that sections were found that appear to be longer than the WORK HISTORY and EDUCATION sections combined. This usually indicates an issue identifying the sections correctly and the majority of the content ended up being contained in the incorrect section. Section Indexes
311 Indicates if a contact information section was found somewhere other than the top of the resume. Contact information should only be found at the top of the resume.
312 Indicates if a publications section with a significant amount of content appears in the resume. Publications should be avoided in a resume, but if necessary, they can be added but should absolutely be no longer than 10 lines. Section Indexes
323 Indicates if multiple sections of the same time have been found in the resume. Section Types
324 Indicates if any sections with no text, other than the header, have been found in the resume. Section Indexes
325 Indicates if any sections with no header have been found in the resume. Section Indexes
331 Indicates that the number of jobs found in the resume exceeds the threshold of 30 jobs.
Data Missing (200-299)
211 Indicates if no email address was found to contact the candidate.
212 Indicates if no phone number was found to contact the candidate.
213 Indicates if no street level address was found for the candidate.
221 Indicates if any jobs were found without job titles. UserArea.Id values of the work history positions
222 Indicates if any jobs were found without job company names. UserArea.Id values of the work history positions
223 Indicates if more than one current job was found with the same employer. UserArea.Id values of the work history positions
224 Indicates if any jobs were found without start dates. UserArea.Id values of the work history positions
225 Indicates if any jobs were found without end dates. UserArea.Id values of the work history positions
226 Indicates if no jobs were found within the past year of the revision date.
231 Indicates if any educational degrees were found without degree names. UserArea.Id values of the degree
232 Indicates if any educational degrees were found without school names. UserArea.Id values of the degree
Suggested Improvements (100-199)
111 Indicates if a references section was found. A resume does not need to include a references section.
112 Indicates if a separate skills section was found. Skills should be included in the context of work history and education descriptions.
113 Indicates if a publications section without a significant amount of content appears in the resume. Including a publications type section in a resume should always be avoided. Section Indexes
121 Indicates if a driving license number was found in the resume. Do not include this level of personal information in a resume. (Only applies to US resumes)
122 Indicates if a passport number was found in the resume. Do not include this level of personal information in a resume. (Only applies to US resumes)
123 Indicates if the candidates marital status was found in the resume. Do not include this level of personal information in a resume. (Only applies to US resumes)
124 Indicates if the candidates date of birth was found in the resume. Do not include this level of personal information in a resume. (Only applies to US resumes)
131 Indicates if multiple addresses were found in the contact information section. Only one contact address should be included in a resume.
132 Indicates if multiple email addresses were found in the contact information section. Only one contact email address should be included in a resume.
133 Indicates if multiple phone numbers were found in the contact information section. Only one contact phone number should be included in a resume.
141 Indicates if any jobs or companies with a street level address were found. Never include a street level address for a job or company in a resume. UserArea.Id values of the work history positions
142 Indicates if any schools with a street level address were found. Never include a street level address for a school in a resume. UserArea.Id values of the degree
151 Indicates if any sections were found with the header not on a separate line above the content for that section. Section Indexes
161 Indicates the resume contains high school education as well as higher-level education. UserArea.Id values of the high school degrees found

TimedOut

If the Parser timed out while processing the resume, then a TimedOut element is reported. The value is the number of milliseconds spent parsing before the timeout was reached.

The type attribute has one of these values:

  • soft: The 15 second timeout was reached. The parser stopped at the next checkpoint and returned all information that had been processed up to that moment.
  • hard: The 22.5 second timeout was reached. The parser stopped immediately (between checkpoints) and returned all information that had been processed up to that moment.

For example, the following represents a soft timeout that occurred after 15.121 seconds:

    <sov:TimedOut type="soft">15121</sov:TimedOut>

ParserConfigurationString

This element reports the Parser configuration that was used during parsing. The configuration is output as a string of Name=Value pairs, each representing a parser setting.

This string value is not necessarily the same as the string that was passed in before parsing. It is the final combination of the value you provided plus pre-configured or built-in defaults and settings changed by the Parser at runtime as a consequence of its locale and language detection.

ParserVersion

This element simply reports the version number of the Parser that produced the output.

Experience Summary

The parser performs many calculations to summarize the experience of the candidate. All of those calculations are reported within the ExperienceSummary element, which looks like this:

<sov:ExperienceSummary>
    <sov:Description>Molly A. Adams's experience appears to be concentrated in Information Technology (Database), with ...</sov:Description>
    <sov:CareerStory />
    <sov:MonthsOfWorkExperience>204</sov:MonthsOfWorkExperience>
    <sov:MonthsOfManagementExperience>151</sov:MonthsOfManagementExperience>
    <sov:CurrentManagementLevel>mid-level</sov:CurrentManagementLevel>
    <sov:HighestManagementScore>55</sov:HighestManagementScore>
    <sov:ExecutiveType>business_dev</sov:ExecutiveType>
    <sov:AverageMonthsPerEmployer>68</sov:AverageMonthsPerEmployer>
    <sov:FulltimeDirectHirePredictiveIndex>9</sov:FulltimeDirectHirePredictiveIndex>
    <sov:ManagementStory>Current position is a mid-level management role: ...</sov:ManagementStory>
    <sov:AttentionNeeded>ATTENTION: The candidate appears to have been in management in a past ...</sov:AttentionNeeded>
    <sov:SkillsTaxonomyOutput />
</sov:ExperienceSummary>

Description

The Description element contains a paragraph of text that summarizes the candidate’s experience. This paragraph is generated based on the other data points within the ExperienceSummary. The paragraph is generated in the same language as the resume for Dutch, English, French, Spanish, and Swedish (but not yet for German or Greek).

CareerStory

The CareerStory element contains a paragraph of text that summarizes the candidate’s entire career.

MonthsOfWorkExperience

The number of months of work experience as indicated by the range of StartDate and EndDate values in the various PositionHistory elements. Overlapping date ranges are not double-counted. This value is NOT derived from text like “I have 15 years of experience”.

MonthsOfManagementExperience

The number of months of management experience as indicated by the range of StartDate and EndDate values in the various PositionHistory elements that have been determined to be management-level positions. Overlapping date ranges are not double-counted. This value is NOT derived from text like “I have 10 years of management experience”.

CurrentManagementLevel

Computed level of management for the current position. One of the following values:

  • low-or-no-level
  • low-level
  • mid-level
  • somewhat high-level
  • high-level
  • executive-level

HighestManagementScore

The highest score calculated from any of the position titles. The score is based on the wording of the title, not on the experience described within the position description.

  • 0 to 29 = Low level
  • 30-59 = Mid level
  • 60+ = High level

ExecutiveType

If HighestManagementScore is at least 30, then the job titles are examined to determine the best category for the executive experience, from among the following:

  • none
  • admin
  • accounting
  • business_dev
  • executive
  • financial
  • general
  • it
  • learning
  • marketing
  • operations

AverageMonthsPerEmployer

The average number of months a candidate has spent at each employer. Note that this number is per employer, not per job.

FulltimeDirectHirePredictiveIndex

A score (0-100), where 0 means a candidate is more likely to have had (and want/pursue) short-term/part-time/temp/contracting jobs and 100 means a candidate is more likely to have had (and want/pursue) traditional full-time, direct-hire jobs.

ManagementStory

The ManagementStory is a plain text line-by-line summary of the management experience.

AttentionNeeded

A message containing information about something abnormal about the candidate (e.g. the candidate was in management at one point but not at their most recent position). This does not appear in the results if nothing abnormal found.

Customizing Skills

Skills customization is the most popular and flexible feature of our parser. You can add and remove skills in our existing list, create your own skills lists, and even maintain multiple lists for multi-tenant implementations. Combining your custom skills taxonomy with Sovren's skills taxonomy can provide superior parsing results. We highly recommend evaluating this feature and considering the valuable impact it can have on your parsing.

This guide describes the various methods that can be used to customize the set of skills and taxonomies that are used by Sovren parsers. These customizations are for use only with and by the Sovren software and for no other use.

Terminology

Skills (known as Competency elements in HrOpenStandard.org’s Resume schema which we support) are the keywords and phrases that appear within documents to represent a tool, practice, something known, something used, etc. The parser supports an unlimited number of skills, and these skills can be arranged into a hierarchy of parent skills and child skills. As an example, the skill "Microsoft SQL Server" may have children of "MSSQL", "SQL 2000", "SQL Server 2000", and on and on, so that if one of those child terms is found then you know that you have found someone with "Microsoft SQL Server" experience. The children of a skill can often be thought of as its synonyms.

Taxonomies are a categorization scheme for skills. The parser supports an unlimited number of taxonomies arranged into a hierarchy of parent taxonomies and sub taxonomies. Every skill must be traceable to a sub taxonomy, so there must be at least one parent taxonomy and one sub taxonomy. Examples of taxonomies are "Information Technology" and "Human Resources", and there are child taxonomies such as "Database", "Project Management", and "Programming". The taxonomies can alternatively be organized by industry, line of business, or any other form of categorization you wish. You can even choose to use different categorizations for the same set of shared skills.

The primary difference between skills and taxonomies is that the skill terms are parsed out of the document, while the taxonomy names are not. The taxonomy names are only used when reporting calculated "best fit" taxonomies and executive summaries.

Note: "Taxonomies" and "Skills" as described above, are kept in two separate files. We refer to these files as "SKILLS_TAXONOMIES.[xx].txt" AND "SKILLS.[xx].txt".

Common Mistake

Customers often tell us" "Hey, I don't need the whole built-in taxonomy. We are just in [Healthcare, IT, retail, whatever] and we only care about those skills, so we are only going to parse with those skills.

HUGE MISTAKE. Totally logical, and totally wrong.

Why?

Simple. The purpose of the skills taxonomy and skills parsing is TWOFOLD:

  1. Determine what skills a person has used, and when, and for how long.
  2. Using the above data, determine "who and what" the person is: are they a Programmer who has worked on Nursing applications, or are they a Nurse who has used certain medical software?

And #2 is why using only a subset of the built-in taxonomy is a huge mistake: because if the only skills that you parse for are nursing skills, then it follows that 100% of all resumes parsed will calculate that the candidate is a nurse. Which is false. An attorney, or a retail clerk, or a programmer, is not a nurse. But if the only skills in your taxonomy are nursing skills, then EVERYONE will be classified as a nurse, including every plumber, every retail clerk, every attorney, every programmer -- EVERYONE!!!

So, DON'T DO IT! Use the whole taxonomy. And supplement the Sovren built-in skills by adding in additional skills as you need them, rearranging the skills as desired, and even deleting some skills you don't like.

Built-In Skills

The parser currently includes built-in language-specific skills for every supported language.

Skills parsing can be configured to use:

  • (Default) The entire built-in language-specific list from above
  • A subset of the built-in taxonomies
  • An entire custom list or multiple custom lists
  • A subset of the custom lists
  • A combination of the built-in list and custom lists, or subsets thereof

You can choose which set of skills to use per transaction, meaning for example that you could simultaneously parse one resume for with skills desired by your Customer A and another resume for using skills desired by your Customer B, and so on. Or you could choose to parse all resumes for all skills, and then filter the skills afterwards, since the parser returns the source (built-in, custom list 1, 2, etc.) and taxonomy for each skill it finds .

Custom Skills

You can create as many different sets of custom taxonomy/skills files as you wish, such as one or more sets per customer in multi-tenant architectures. At runtime, per parsing transaction, you tell the Parser which taxonomy/skills list (or combination of lists) to use. The following sections describe the formats for these files in detail, and how to use them at runtime.

NOTE: For parent-child relationships, the parent record must be listed before its child records.

Think of child skills (those that have a ParentId) as being specializations or synonyms of the parent skill. You might include many different abbreviations and spellings of the parent. For example, if the parent skill is VISUAL C++ you could define child skills such as VC++, MSVC++, MS VISUAL C++, VCPP and so on.

SDF (System Data File), a Fixed-Column-Width Format

SDF files are a defined fixed-column-width format. They are very simple to create and use. The SDF format is probably the oldest file format for databases and data exchange. The SDF format has these advantages that no other format offers:

  • SDF files can be edited by ANY text editor (but use a Unicode-aware editor and save with UTF-8 encoding!).
  • SDF fields can contain ANY non-control characters, without worrying about escaping them.
  • SDF files can be easily read/scanned/searched by human beings since all the data fields line up.
  • SDF files cannot be accidentally corrupted by importing/exporting or sorting them to and from a spreadsheet since they are edited in a text editor and not a spreadsheet.

A Taxonomy SDF file contains a list of taxonomies, or categories, that skills are grouped into.

Here is the layout of the "Taxonomies" file in SDF format:

Columns Field Name Field Description
1-30 ParentTaxonomyId The TaxonomyId for the parent taxonomy, if any; otherwise blank.
31-60 TaxonomyId The unique id for this taxonomy record. Required.
61-120 TaxonomyName The name of this taxonomy.

A "Skills" SDF file contains a list of skills that are assigned to taxonomies specified in a Taxonomy SDF file.

Here is the layout of a "Skills" file in SDF format:

Columns Field Name Field Description
1-60 SkillName The term or phrase that represents this skill in resumes. This can be a single word or multiple-words. It can contain any combination of letters, numbers, punctuation, symbols and whitespace. For example: VISUAL C++ 6.0. Capitalization is unimportant. If you need more than 60 characters for a skill, you do not have a skill: you have a skill definition or description, which is not useful for parsing.
61-90 ParentSkillId The SkillId of the parent skill, if this is a child skill. Either the ParentSkillId or TaxonomyId must be specified, but not both.
91-120 SkillId The unique id of this skill. It can contain letters, numbers, symbols and even internal whitespace.
121-150 Unused This field is currently unused and is ignored.
151-180 TaxonomyId The id of the Taxonomy that this skill belongs within (and this means that the skill is NOT a child skill of another skill). Either the ParentSkillId or TaxonomyId must be specified, but not both.
181-190 AllowLowerCaseMatch Set to “true” (without quotes) to match the SkillName in a document even if it appears in all lower case. Otherwise, the value will be set to "false".
191-200 Output Set to "false" (without quotes) to suppress output of this skill in the parsed output. Otherwise, the value will be set to "true".

Create Custom Skills Lists

The Parser only supports taxonomies that are constructed with TWO LEVELS of taxonomy and ONE OR TWO LEVELS of skills. In other words, you must have at least this structure:

ROOT
    PARENT Taxonomy
        CHILD Taxonomy
            SKILL
                OPTIONAL CHILD Skill(s)

You may NOT have skills that are tied directly to a Parent taxonomy. All skills must be tied to a Subtaxonomy or to a parent skill that is tied to a Subtaxonomy. If you try to load any lists that do not correspond to this format, you will receive validation errors.

Creating custom skills lists using the Skills Editor app

We have an app for this! The Sovren Skills Editor app! That app will allow you to make changes to the Sovren built-in skills lists for use by Sovren products, and create, import, export, and edit custom lists as well. The custom lists you import should be created in the SDF formats defined above.

The built-in skills are available in an encrypted binary format. They are provided in files with an .SDF extension. These encrypted files can be opened, edited, and saved only with the Sovren Skills Editor, and they can be used anywhere a normal text SDF file or stream can be used; however, you are unable to view, edit, or use these.

The encrypted SDF files binary file format is undocumented. You may not in any way attempt to recreate the Sovren skills lists during your use of the Sovren Skills Editor, nor by any other method. Any attempts to re-type, copy and paste, create screenshots, etc., are strictly prohibited.

As a customer, you should have a login to the Sovren Portal where you have been granted access to the Skills Editor app.

ALWAYS use the Sovren Skills Editor to update skills lists, even if you create them manually or by some custom process, because the Sovren Skills Editor contains built-in checks to ensure that the files are valid and are named correctly. For SaaS customers, the app even includes a way to transmit the files to the Sovren SaaS system for storage and use.

Creating custom skills lists manually or programmatically

If for some reason you need to create your lists manually or programmatically outside of the Skills Editor app, you can do so with any program that can output a flat text file that conforms to our SDF format (see above section for more info). For example, you may have lists that already exist in a database table. You can then export that data through a program or process to generate your SDF formatted files.

Once you have generated your SDF files (one for your taxonomies and one for your skills), we still strongly recommend that you load them through the Skills Editor app to make sure they are valid. However, if you are a SaaS customer and want to post your lists programmatically to your account, you can do so by using the SetData method (REST | SOAP).

Skills Editor App Walkthrough

Using your own custom skills lists through the SaaS API is simple and straight-forward. In this example, we will create and test a custom skills list:

  1. Download the Sovren Skills Editor package available through My Sovren Portal under the Tools section in the Developer Center.
  2. Unzip the Sovren.SkillsEditor.zip file to a local directory.
  3. Open the Sovren.SkillsEditor.exe app.
  4. Select the Aviation taxonomy, right-click and select New Taxonomy. Then type in "Commercial":



  5. Add the following skills under the Commercial taxonomy (PERFORMANCE CHARTS, BALANCE COMPUTATIONS, WEIGHT COMPUTATIONS, AERODYNAMICS, ALTITUDE OPERATIONS). You must always have at least 2 taxonomy levels and skills can't be assigned directly to a parent taxonomy (they must be assigned to a subtaxonomy). In this example, Aviation is the parent taxonomy and Commercial is the subtaxonomy so we are assigning skills that are specific to that taxonomy.



  6. Now, click on File -> SaaS Skills Data... and type in your Account ID and Service Key in the form:



  7. Click Save As... and enter "myskills" in the Name field and "English (en)" in the Culture field and click Save (you should receive a message that the skills were successfully posted and are now able to use this custom skills list when parsing).

Customizing Normalization

The Sovren Normalizer is offered as a SaaS API that uses user-editable files to control the output. The normalizer includes built-in normalization lists but you may also implement your own custom lists.

Normalizer Files

The Sovren Normalizer uses standard tab delimited text files to normalize terms.

  • We strongly recommend using the Sovren Normalizer Editor application (available in the Sovren Portal site) to create and maintain custom normalizer files. The application will take care of applying all the necessary settings, formats, and posting of your files to your SaaS account (if applicable) through a simple GUI.
  • These files are hand editable. If you deploy the Normalizer in multiple places, these files will be out of sync unless you copy the data files into every Data folder under every Normalizer location or configure all Normalizer instances to use a common Data folder.
  • You must edit and save the files in a Unicode-aware application and manner. We suggest Notepad. When saving the files in Notepad, you must choose File => Save As => and then change the encoding to UTF-8 before saving.
  • Back up the files before saving. Do NOT modify the file names, nor leave off the file extensions, nor rename the Data folder.
  • If you are using the installed API and make changes to the files while the application containing the Normalizer class is still running, you will either need to restart the application or call Normalizer.ClearDataCache for the changes to take effect.

SN.COMPANY.ALL.txt

Contains the normalized company name and then the raw company name, separated by a single tab and only by a single tab. For example, the following company names will all get normalized to “Amazon”:

Amazon	Amazon.co.uk
Amazon	Amazon.com
Amazon	Amazon.com Inc.
Amazon	Amazon.com, Inc

SN.COMPANY_DQ_ANYWHERE.ALL.txt

Contains the normalized company name and then the disqualifier word or phrase, separated by a single tab and only by a single tab. If the disqualifier word or phrase appears ANYWHERE in the resume, it will prevent the company name from being normalized to the value shown. For example, the following company names will not get normalized to “Amazon Web Services” if the term to the right appears anywhere in the resume:

Amazon Web Services	Allied Waste Services
Amazon Web Services	Allied Waste
Amazon Web Services	Waste Services 

SN.COMPANY_DQ_IN_NAME.ALL.txt

Contains the normalized company name and then the disqualifier word or phrase, separated by a single tab and only by a single tab. If the disqualifier word or phrase appears ANYWHERE in the company name, it will prevent the company name from being normalized to the value shown. For example, the following terms will not get normalized to “Amazon” if the term to the right appears anywhere in the company name:

Amazon	Amazon Floral Services
Amazon	Amazon Landscaping 

SN.POSITION_TITLE_SYNONYMS.ALL.txt

Contains the normalized job title word or phrase and then the job title synonym, separated by a single tab and only by a single tab. For example, the following titles will get normalized to “VP”:

VP	Vice Pres
VP	Vice Pres.
VP	Vice President

SN.SCHOOL.ALL.txt

Contains the normalized school name and then the raw school name, separated by a single tab and only by a single tab. For example, the following school names will get normalized to “California Institute of Technology”:

California Institute of Technology	California Institute of Technology Pasadena
California Institute of Technology	California Institute of Technology-United States-CA
California Institute of Technology	Caltech
California Institute of Technology	Southern California Institute of Technology

SN.SCHOOL_DQ_ANYWHERE.ALL.txt

Contains the normalized school name and then the disqualifier word or phrase, separated by a single tab and only by a single tab. If the disqualifier word or phrase appears ANYWHERE in the resume, it will prevent the school name from being normalized to the value shown. For example, the following school names will not get normalized to “California Institute of Technology” if the term to the right appears anywhere in the resume:

California Institute of Technology	California Technology School
California Institute of Technology	California School of Technology

SN.SCHOOL_DQ_IN_NAME.ALL.txt

Contains the normalized school name and then the disqualifier word or phrase, separated by a single tab and only by a single tab. If the disqualifier word or phrase appears ANYWHERE in the school name, it will prevent the school name from being normalized to the value shown. For example, the following school names will not get normalized to “Harvard University” if the term to the right appears anywhere in the school name:

Harvard University	Harvard Technical College
Harvard University	Harvard Community College

SN.SCHOOL_SYNONYMS.ALL.txt

Contains the normalized school word or phrase and then the school synonym, separated by a single tab and only by a single tab. These synonyms are applied to a raw school name to produce one of the standardized variations of the school name used to find a matching entry in SN.SCHOOL.ALL.txt. If a match is not found in the SCHOOL list, then these synonym substitutions are applied to the dynamic normalized value. For example, the following terms will all get normalized to “University”:

University	U
University	U.
University	UNIV
University	UNIV.

SN.DEGREE.ALL.txt

Contains the normalized degree name and then the raw degree name, separated by a single tab and only by a single tab. For example, the following degree names will get normalized to “Bachelors”:

Bachelors	B. A
Bachelors	B. A.
Bachelors	Bachelors of Arts

SN.STRIP_FROM_COMPANY_NAME.ALL.txt

Contains words which will be stripped from the company names. For example, the following terms will be removed from company names if found:

& CO
& CO.
& COMPANY

SN.STRIP_FROM_SCHOOL_NAME.ALL.txt

Contains words which will be stripped from the school names. For example, the following terms will be removed from school names if found:

Inc.
Incorporated
Corp.

Normalizer Output

Whenever a call to normalize a resume is performed, the results will be displayed in the UserArea section of the entity that is being normalized (i.e. company name, school name, etc.). The following XML section displays a normalized company name (from “Cisco Systems, Inc” to “Cisco”):

<EmployerOrg>
    <EmployerOrgName>Cisco Systems, Inc</EmployerOrgName>
    <PositionHistory positionType="directHire" currentEmployer="true">
        <Title>Dir of Web Applications Development</Title>
        <OrgName>
            <OrganizationName>Cisco Systems, Inc</OrganizationName>
        </OrgName>
        ...		
    </PositionHistory>
    <UserArea>
        <sov:EmployerOrgUserArea>
            <sov:NormalizedEmployerOrgName>Cisco</sov:NormalizedEmployerOrgName>
        </sov:EmployerOrgUserArea>
    </UserArea>
</EmployerOrg>    

Normalizer Data Editor App Walkthrough

In this example, we will normalize data for a specific company name and apply it through the SaaS API:

  1. Download the Sovren Normalizer Editor package available through My Sovren Portal under the Tools section in the Developer Center.
  2. Unzip the Sovren.Normalizer.Editor.zip file to a local directory.
  3. Open the Sovren.Normalizer.Editor.exe app and accept the disclaimer.
  4. Click File -> New and select the SN.COMPANY.ALL.txt. (click on "See file descriptions" to view the function of each list)

  5. In this case, we will be normalizing a variety of possible company names for "Cisco Systems" with the normalized term "CISCO". To begin, start typing in the following terms. All the terms in the Term to Normalize column will be normalized to the Normalized Term column value. (NOTE: you can also import pre-made lists into the Editor.

  6. Click File -> SaaS Normalizer Data... and type in your Account ID and Service Key in the form:

  7. Click Save As... and enter "mynormalizerlist" in the Name field and "English (en)" in the Language field:

  8. Click Save and you should receive a message that the normalization data was successfully posted to your SaaS account.

Document Conversion Result Codes

This document lists the set of result codes reported by the Sovren components and sample applications as a result of converting the original document (DOC, PDF, etc.) to another format (e.g. plain text for resume parsing). The result codes identify success, warning or error.

These response codes are used in the following API endpoints:

Success SubCodes

This code indicates that the document converted successfully and that no warnings were detected.

Code Description
ovIsProbablyValid Output did not fail any tests.

Warning SubCodes

These codes indicate that a problem might exist. In some cases these warnings should be heeded, and in other cases these warnings should be ignored. The context matters. For example, the ovHighAsciiExceeds10Percent warning is not a concern when the document is a Japanese resume, but it almost certainly indicates a problem if you are processing (or expecting) an English resume.

Generally, unless the code indicates corruption, you should go ahead and try to parse the converted text. For instance, we've seen valid resumes that return ovAvgWordLengthLessThan4 because there was a list of dozens of two-digit state codes embedded in the text. However, in an online processing scenario, you may want to first ask the candidate to review the converted text to ensure that it was not corrupted in some way.

Code Description
ovProbableGarbageInText 5% or more of the characters in the text are "symbols" (like the copyright sign and the trademark symbol) rather than regular letters and digits and punctuation. This is usually a sign that there is garbage in the converted text.
ovTooFewLineBreaks There are no line breaks, or there are far fewer line breaks than there should be, given the length of the text. Specifically, there are fewer than 10 line breaks and the average line length is greater than 500 characters.
ovUnknown Validity could not be determined. Occurs if the converted text is empty for some unknown reason.

Error SubCodes

These codes indicate a definite problem and indicate that the conversion output should be ignored.

Code Description
ovConfigurationError An installation/deployment/configuration error is preventing one or more modules from working properly. As a result, the converted text may be worse than expected.
ovCorrupt The input document was recognized but is either corrupt or cannot be converted because the details of the file cannot be understood.
ovCouldNotLoadFile The specified file was found but could not be read.
ovErrorOnOutputToHtml An internal exception occurred while converting to HTML. Enable trace listener for TraceSource named "DocumentConverter" to capture details about the error.
ovErrorOnOutputToRtf An internal exception occurred while converting to RTF. Enable trace listener for TraceSource named "DocumentConverter" to capture details about the error.
ovErrorOnOutputToText An internal exception occurred while converting to TEXT. Enable trace listener for TraceSource named "DocumentConverter" to capture details about the error.
ovErrorOnOutputToXml An internal exception occurred while converting to WordXML. Enable trace listener for TraceSource named "DocumentConverter" to capture details about the error.
ovFileNotFound We could not find the disk file that we were asked to convert.
ovIsEncrypted The document is encrypted and cannot be opened except using a password, which we obviously do not have!
ovIsImage The document is just an image file. You will need to OCR the file to extract the text.
ovNullInput A null parameter or empty byte array or empty file was passed to the converter.
ovTimeout The timeout was reached during conversion.
ovUnsupportedFormat We encountered an input/output format combination that we cannot convert with the current converters, using the subset that you enabled using the EnabledModules property.
ovWordConvErrorAndProbableProblems We encountered an error converting the document, and we think it probably has some invalid text in it, or is truncated.

Consuming XML Results

XPath

Developers who are not familiar with XML namespaces may be surprised when the following XPath statement returns null:

    /Resume/StructuredXMLResume/ContactInfo/PersonName/GivenName

The reason is that the XPath statement is requesting elements in the “no namespace”. In order for the XPath to work correctly, you can either remove the default namespace or you can specify the namespaces in your XPath.

Using Namespaces

In your XPath API, associate each namespace with a prefix and then use those prefixes in your XPath statements.

C# Example
XPathDocument doc = new XPathDocument(new StringReader(xml));
XPathNavigator nav = doc.CreateNavigator();
XmlNamespaceManager manager = new XmlNamespaceManager(nav.NameTable);
manager.AddNamespace("hr", "http://ns.hr-xml.org/2006-02-28");
manager.AddNamespace("sov", "http://sovren.com/hr-xml/2006-02-28");
XPathNavigator givenNameNode = nav.Select("/hr:Resume/hr:StructuredXMLResume/hr:ContactInfo/hr:PersonName/hr:GivenName"));
string givenName = (givenNameNode == null) ? String.Empty : givenNameNode.Value;
XPathNavigator languageNode = nav.Select("/hr:Resume/hr:UserArea/sov:ResumeUserArea/sov:Culture/sov:Language"));
string language = (languageNode == null) ? String.Empty : languageNode.Value;
Java Example

See examples for various Java APIs at http://www.edankert.com/defaultnamespaces.html

Removing Default Namespace

If you are unfamiliar with specifying namespaces in XPath, the simplest approach may be to simply remove the namespaces before loading the XML. The following bit of C# code demonstrates how to remove the two primary namespaces in Sovren XML:

xml = xml.Replace("xmlns=\"http://ns.hr-xml.org/2006-02-28\"", "");
xml = xml.Replace("<sov:", "<" );
xml = xml.Replace("</sov:", "</");
XPathDocument doc = new XPathDocument(new StringReader(xml));
XPathNavigator nav = doc.CreateNavigator();
XPathNavigator givenNameNode = nav.Select("/Resume/StructuredXMLResume/ContactInfo/PersonName/GivenName"));
string givenName = (givenNameNode == null) ? String.Empty : givenNameNode.Value;
XPathNavigator languageNode = nav.Select("/Resume/UserArea/ResumeUserArea/Culture/Language"));
string language = (languageNode == null) ? String.Empty : languageNode.Value;

After removing the namespaces, your XPath statements will be simple and will not require any namespace definitions or prefixes.

XML Namespaces

All elements in the HROpenStandards.org Resume standard belong to the namespace:

    http://ns.hr-xml.org/2006-02-28

The XML file specifies the default namespace via the “xmlns=” attribute:

<?xml version="1.0" encoding="UTF-8"?>
    <Resume xml:lang="en"
        xmlns="http://ns.hr-xml.org/2006-02-28"
        xmlns:sov="http://sovren.com/hr-xml/2006-02-28">
        ...

This means that every XML element that does not have a namespace prefix or its own xmlns attribute belongs to that default namespace. This is normal syntax and is understood by all conformant XML readers.

Tuning Tips

Reduce Roundtrips

Some HTTP clients automatically enable the HTTP request header "Expect: 100-Continue" for POST requests, which causes the client to send a header-only request, wait for a positive response from the server, and then send the body of the request. This results in two full roundtrips to the server for every request. If your roundtrip time (e.g. ping time) is 100 ms, then you're adding 100 ms to every request for no reason. We recommend that you disable the "Expect: 100-Continue" behavior for your SOAP/HTTP connection. In .NET environments, set Expect100Continue = false.

SOAP Service Wsdl

Don't make a call to get the wsdl for each transaction. Instead, set up your proxy classes once, and leverage those classes for each transaction.