Friday, June 5, 2009

Convert Stock CUSIPs to Stock Symbols with a MATLAB program

UPDATE : Thanks to a comment, I changed the code to reflect the new changes at fidelity site.

CUSIP is a 9 character(alpha-numeric ) identifier. It actually stands for "Committee on Uniform Security Identification Procedures". Sometimes it is very useful to be able to look up the Stock Symbol that the CUSIP represents. I had a list of CUSIPS and some data associated with it. I did not have the Stock Symbols Associated with Them. I was only interested in Stock CUSIPS. I searched online and found out that there was no automated way of finding out the stock symbol associated with stock CUSIPS. So I wrote the following program to look up the CUSIP at the Fidelity website and grab the stock symbol associated with it. I extensively used regular expressions. I hope this program will be useful to others.

For example:
'031162100' given Amgen Inc---AMGN
Symbols = CusipToSymbolLookUp({'031162100';'03116210';})

function Symbols = CusipToSymbolLookUp(Cusip)
% Symbols = CusipToSymbolLookUp(Cusip) gives a list of STOCK symbols that
% correspond to a list of Cusips.
% This function looks up a STOCK symbol for a given CUSIP
% The CUSIP needs to be 8 or 9 characters long
% If a valid 8 character Cusip is given, then a 9th check digit is added if
% possible.It accesses the fidelity website and does html parsing
% to get the symbol name. Cusip can be a cell array of Cusips
% More information on CUSIPs can be found at:
% I Thank Nabeel Azar for his program checkcusip.m
% Example:
% Symbols = CusipToSymbolLookUp({'031162100';'03116210';})

% Atleast one input is required
if(nargin < 1)
error('Atleast one Input is needed')
% Check if its either a cell array or Character
error('Cusip needs to be either a character or Cell Array')

if(iscell(Cusip) && ~isvector(Cusip))
error('Cusip needs to be a cell array')

% Convert Char to a cell string
Cusip = cellstr(Cusip);

% Find how many cusips were given
ncusips = length(Cusip);

% Intial Web URL
weburl = [''];

% Pre assign the Output
Symbols = cell(ncusips,1);

% Now go through the List and do the processing
for idx = 1:ncusips

% If The Length of the Cusip is 8 digits/characters long,
% Then It is converted into 9 digits using a program called checkcusip
% If it returns a logical false, then it is a wrong CUSIP
% If it returns a double digit, then join the checkdigit to the
% original CUSIP

Result = checkcusip(Cusip(idx));
if (islogical(Result{1}) && Result{1}==false)
% Join the Cusip with CheckDigit
Cusip{idx} = [Cusip{idx} num2str(Result{:})];

% If the Length is not equal to 9, then just continue

% Construct the URL using the Cusip and read the url
%[weburl Cusip{idx} '&submit=Search']
data = urlread([weburl Cusip{idx} '&submit=Search']);

% Search for a preliminary pattern
%pat = '<(a HREF).*?>.*?';
pat = 'SID_VALUE_ID=([a-zA-Z]+)">';

% Use regexp to match the pattern

% Use another pattern to get the symbol
% pat = '>\w*<';
% sname=char(regexp(data2{1},pat,'match'));
Symbols(idx) = data2{1};

% % Pre-Process the Output before storing it in an array
% if(~isempty(sname))
% sname([1 end])='';
% Symbols{idx} = sname;
% end


function Result = checkcusip(inputCell)
% CHECKCUSIP is used to validate 9 digit CUSIPs and
% provide the checkdigit for 8 digit CUSIPs.
% Note that if you give this function a combination of
% 8 and 9 digit CUSIPs, you need to check both the class
% (logical or non-logical) as well as the value of the
% output. Logicals are used to indicate validity of a
% 9 digit CUSIP, while non-logical doubles are used to
% supply the checkdigit for an 8 digit CUSIP.

% Convert the input to a char array
if iscell(inputCell)
cusipCharArray = strvcat(inputCell{:});
error(['Inputs must be cell arrays of CUSIP strings.'])

% Make sure there are at least 8 columns
if size(cusipCharArray,2)<8 p="">error(['Must supply 8 or 9 digit CUSIPs.']);

% Make them all lowercase
cusipCharArray = lower(cusipCharArray);

% Convert the string digits to numerical values and
% the characters to their numerical values, with 'A':=10
% Set spaces (for computing the checkdigit) to NaNs

% Transpose the array, and work down the columns.
longCusipString = double(cusipCharArray);
longCusipString = longCusipString.';
numericalLocations = (longCusipString>='0' & longCusipString<='9');
charLocations = (longCusipString>='a' & longCusipString<='z');
NaNLocations = longCusipString==' ';

longCusipString(numericalLocations) = longCusipString(numericalLocations) - '0';
longCusipString(charLocations) = longCusipString(charLocations) - 'a' + 10;
longCusipString(NaNLocations) = NaN;

% Get the cusip digits
cusipNums = longCusipString(1:8,:);

% Scale with scaling factors
cusipNums = diag([1 2 1 2 1 2 1 2]) * cusipNums;

% Sum the digits in each term >=10;
gt_10 = cusipNums>=10;
cusipNums(gt_10) = floor(cusipNums(gt_10)/10) + rem(cusipNums(gt_10),10);

% Sum the resulting values
cusipNums = sum(cusipNums);

% Get the last digit
lastDigit = rem(cusipNums,10);

% Generate the checkdigit
checkDigit = 10 - lastDigit;
checkDigit(checkDigit==10) = 0;


% Create a cell array the right size for the output.
Result = cell(numel(inputCell),1);

% If no check digit was given in the input, output the checkdigit
if size(longCusipString,1)==9
needCusip = isnan(longCusipString(9,:));
needCusip = logical(ones(1,size(longCusipString,2)));
Result(needCusip) = num2cell(checkDigit(needCusip));

% If a check digit was given, validate it
if size(longCusipString,1)==9
isCheckdigitCorrect = longCusipString(9,~needCusip)==checkDigit(~needCusip);
Result(~needCusip) = num2cell(isCheckdigitCorrect);

% Only the 1st, 4th, 5th, or 6th digit may be an alphanumeric letter
% (1st for international issues)
badIdx = any(longCusipString([2 3 7 8],:)>=10,1);
Result(badIdx) = {logical(0)};

% The first digit cannot be an i, o, or z
badIdx = any(longCusipString(1,:)=='i' | longCusipString(1,:)=='o' | longCusipString(1,:)=='z',1);
Result(badIdx) = {logical(0)};

% Reshape the result
Result = reshape(Result,size(inputCell));

1 comment:

aviolov said...

Thanks, the fidelity format page has changed a bit since this was posted:

Here are the most relevant changes I needed to make it work again:
% Intial Web URL
weburl = [''];


% Construct the URL using the Cusip and read the url
weburl_detailed = [weburl Cusip{idx} '&submit=Search'];
data = urlread(weburl_detailed);

% Search for a preliminary pattern
pat = 'SID_VALUE_ID=([a-zA-Z]+)">';

% Use regexp to match the pattern
if ~isempty(data2)
Symbols(idx) = data2{1};